A Novel Classification Approach for C2C E-Commerce Fraud Detection

Transcription

1 A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University, Beijing , China, [email protected] 2 School of Computer and Information Engineering, Beijing Technology and Business University, Beijing , China Abstract Fraud in consumer-to-consumer (C2C) e-commerce is becoming more and more serious. The purpose of this study is to develop an effective fraud detection model to assist customers in identifying potential fraud transactions. We use Naive Bayes (NB), decision tree C4.5 and AdaBoost to construct the model for classifying imbalance transaction data, and majority voting is used to combine the model. Several experiments are conducted on Taobao data set to verify the classification performance of the proposed model using four popular performance metrics. The experimental results demonstrate that the model based on NB and AdaBoost&C4.5 can significantly increase the ability to locate potential fraud transactions in C2C e-commerce. Keywords: Fraud Detection, Decision Tree, Adaboost, Imbalance Data, Classification 1. Introduction The fast and wide development of Internet has made C2C e-commerce become more and more popular because of low cost and high efficiency. During the high development of C2C e-commerce, hidden problems have been exposed. The virtual internet transaction will make it not easy to check the identification of both sides in a transaction and customers have difficulty in buying products because of asymmetric information of product quality. Therefore, the lemon effect will occur [1], and it is hard to find a feasible solution for this problem [2-5]. Buyers take this incentive into consideration, and deem the quality of goods to be uncertain. Only goods with average quality will be considered, which in turn will have a side effect that goods with above average quality will be driven out of the market, ultimately leading to the destruction of the market. Currently, the reputation systems chosen by most C2C e-commerce sites to prevent fraud mainly use simple summation or average of ratings. Summation of ratings is simply to sum the number of positive ratings and negative ratings separately, and to keep a total score as the positive score minus the negative score. The reputation systems adopted by ebay and Taobao use summation of ratings [6]. Average of ratings is to compute the reputation score as the average of all ratings which is used by Amazon. Reputation systems can hardly depict the trader s true reputation, and often be attacked by fraudsters. As C2C e-commerce develops, more and more buyers and sellers participate in it. Meanwhile, the number of fraud transaction also runs up remarkably. Non-fraud transaction is represented by a large number of transactions while fraud transaction is represented by only a few. So it is extremely difficult to extract the fraud patterns in C2C e-commerce and cause the class imbalance problem. In this study, we propose an innovative C2C e-commerce fraud detection model based on NB and AdaBoost&C4.5 to classify imbalance transaction data. For the purpose of building the model, we examine the components of the model and the architecture is revised. Then the capability of discriminating abnormal transactions from experiments on Taobao data set will be evaluated. This paper is organized as follows: Section 2 describes the methodology in the model. Section 3 describes the classification mechanism of C2C ecommerce fraud detection model. Section 4 details our data and performance metrics, followed by experiments of the model and results. Finally, the conclusions of this study and future work are provided in Section 5. International Journal of Digital Content Technology and its Applications(JDCTA) Volume7,Number1,January 2013 doi: /jdcta.vol7.issue

2 2. Research methodology In the following section, we will discuss the research methodology used in this study and the main components involved. 2.1 Learning algorithms In fraud detection research, there are several widely used classification algorithms which are naïve Byes, C4.5 decision tree, AdaBoost and so on[7-9]. 1) Naive Bayes Naive Bayes is a simple probabilistic classifier based on applying Bayes theorem with naïve independence assumptions. It assumes that the presence or absence of a particular feature of a class is unrelated to the presence or absence of any other feature [7]. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifier often work much better in many complex real-world situations than one might expect. NB can give better predictive accuracy than other algorithms such as C4.5 and BP when attributes are normally distributed and not redundant. While attributes are not normally distributed and redundant, it will show lower predictive accuracy. 2) Decision tree C4.5 Decision tree is a kind of decision support techniques that uses a tree-like graph or model of decisions and their possible consequences. In machine learning, decision tree is a predictive model that is a mapping from observations about an item to conclusions about its target value. Leaves represent classifications and branches represent conjunctions of features that lead to those classifications [8]. Most decision tree inducers assume that the overall prediction decision can be made by dividing the decision into a sequence of small decisions. Different decision tree inducers mainly differ in the goodness measure used to select the splitting attribute at each intermediate tree node [10]. C4.5 is a decision tree algorithm which can not only make accurate predictions but also explain hidden patterns in data. It can deal with numeric attributes, missing values, estimating error rates and generating rules. C4.5 is one of the most commonly used algorithms in the data mining and machine learning communities and C4.5 combined with under-sampling or over-sampling is quickly becoming the accepted baseline to beat in research of class imbalance.in prediction accuracy, C4.5 performs better than CART and ID3. However, C4.5 may cause scalability and over-fitting problems when it is applied on large data sets. 3) AdaBoost Boosting is one of the most powerful machine learning approaches to emerge in the past decade. AdaBoost, as a kind of boost, is a meta-algorithm. It can be used to linearly combine many other learning algorithms to correct the misclassifications made by weak classifiers. It is sensitive to noisy data and outliers and less susceptible to the over-fitting problem than most learning algorithms are [11]. If simple weak classifiers are used, the AdaBoost algorithm is very fast [12]. AdaBoost is an ensemble constructing technique and has become a very popular one for its simplicity and adaptability. There are two approaches implemented in AdaBoost: one is re-weighting, and the other is re-sampling. In order to accompany with imbalanced data processing, AdaBoost with re-weighting is used in our study. Using re-weighting approach, all training samples with weights are used in each sample to train the final classifier. It calls a weak classifier repeatedly in a series of rounds t =1,...,T. For each call, a distribution of weights Dt is updated that indicates the importance of examples in the data set for the classification. And the weights of each incorrectly classified example are increased, so that the new classifier focuses more on those examples. All three classifier algorithms have advantages and disadvantages. According to the preceding introduction of algorithms, we can summarize the performance of the three algorithms which are shown in Table 1. Therefore, by using the three algorithms together on the same data, their strengths can be combined and their weaknesses reduced. 505

3 Table 1. Comparison of three algorithms Algorithm Accuracy Scalability Speed Over-fitting NB C4.5 AdaBoost Good Excellent Excellent Excellent Poor Poor Excellent Good Poor Low High Medium 2.2. Imbalanced data processing: Under-sampling Usually, classification algorithms perform not well while handling imbalanced data and the results are biased to the majority class. When learning from imbalance data, traditional classification algorithms tend to produce high prediction for the majority class but low for the minority class [9,13,14]. So the data tends to be classified to majority class which is always the meaningful class we want to get. Traditional classifiers can not have good performance while dealing with imbalanced learning tasks. Hence, unbalanced data should be handled before being applied to classification algorithms. Under-sampling is used in our model to handle imbalanced data. Majority under-sampling is more preferable than minority over-sampling technique [13]. Under-sampling is a kind of approaches that can handle the imbalance data. It is a method in which the minority class remains intact, while the majority class is under-sample. Through using cost curves to explore the interaction of over and undersampling with the decision tree learner C4.5, under-sampling produces a reasonable sensitivity to changes in misclassification costs distribution. On the other hand, over-sampling is surprisingly ineffective, often producing little or no change in performance Combination technology: Majority voting Majority voting method is a kind of combination technologies. Among all the combination technologies, it is by far the simplest for implementation. It does not assume prior knowledge of behavior of the individual classifiers, and it does not require training on large quantities of representative recognition results from classifiers [15]. While employing five combination technologies, which include majority vote, Bayesian, logistic regression, fuzzy integral and neural network, on seven classifiers, majority vote is just as effective as the other more complicated technologies in improving the recognition rate. In combining the decisions of the n classifiers using majority voting method, the sample is assigned the class when there is a consensus, or when more than half of the classifiers agreed on the identity. Otherwise, the sample is rejected. 3. Fraud detection mechanism design There are some fraudulent characteristics in C2C e-commerce. Firstly, changes in the trading characteristics of a trader, such as the types of commodity, turnover, and trading frequency, can be detected as the evidence of abnormal trading. Secondly, Fraud can be detected by finding out similar traits from the account information of sellers to learn rules because sellers with similar background are likely to behave in the same way. In order to detect fraud through related characteristics, it is necessary to gather transaction information, account information and reputation information of sellers. Since there are some special traits of fraud transaction, a functional relationship between transaction attributes and transaction types (fraud or non-fraud) will help us to detect fraud from a large amount of trading data. The purpose of this research is to design an effective and efficient fraud detection model used in C2C e-commerce. In fraud detection, there is always an imbalance between the positive sample representing fraud ones and negative sample representing no-fraud ones. However, classification algorithms tend to ignore the minority class but present an accurate classification for the majority class. Therefore, conventional algorithms are limited in the classification of imbalanced data, but an efficient fraud detection model should focus on the fraud sample (minority class). Thus, the data in the two types of sample should be balanced first and then be trained for the fraud detection model. To solve a complicated problem of detection classification, a single method can hardly meet the detection requirements. Because of the complementarities of classification, different classifiers need to be combined to reduce detection errors and improve detection robustness. In this research, we tend to take 506

4 full consideration of combined classifier in fraud detection to avoid imprudent decisions resulted from using a single classifier. 3.1 Mechanism of C2C e-commerce fraud detection model Figure 1. Mechanism of C2C e-commerce fraud detection model C2C e-commerce fraud detection model is designed to detect abnormal transaction data from transaction data and help users to make decision in selecting commodities. The model has several different steps that begin with the preprocessing of raw data. After that, the data is ready to be utilized for classification. The following step is NB classifier, which is the first classifier. NB classifier can denote the distribution of the training set in the classification and refine the input of next classifier. The third step is the second classifier, a combined classifier based on AdaBoost&C4.5 which is generated using sub-training sets whose sampling approaches are proposed later. It uses majority voting method to combine all sub-classifers in it. The results of the NB classifier are the inputs of the second classifier and classification prediction results can be got from the second classifier. Finally, several performance 507

5 metrics will be conducted on the prediction results. The complete mechanism of C2C e-commerce fraud detection model is shown in Fig Sampling process in C2C e-commerce fraud detection model Figure 2. Generation of samples for classification Figure 2 depicts the sampling model in C2C e-commerce fraud detection to tackle class imbalance problem. The model is a combination of random sampling and under-sampling. According to the qualitative introduction in 3.1, NB classifier is generated to denote the distribution of sample data and C4.5 classifier is generated to get more accurate decision tree rules. So the training set for NB classifier uses random sampling approach to get a part of transaction data as the training set and the sub-training sets for combined C4.5 classifier use under-sampling approach on the former training set. In undersampling, the majority sample is randomly under-sampled and the union of sub-samples account for all majority sample. Initially, the data set is separated into two different sub data sets through random sampling: training set and testing set. The testing set is used to test the classification performance of the model and the training set is input to NB classifier to generate NB classification model. Then, the training set is splitting into two parts. One is the fraud sample and the other is non-fraud sample. After that, nonfraud sub-samples are randomly under-sampled from the majority sample which is non-fraud sample in such 508

6 a way that the ratio of fraud to non-fraud is approximately 1. Later, each non-fraud subsamples are combined with fraud sample to generate sub-training sets for a combined classifier based on AdaBoost&C Experiment and analysis 4.1 Experimental Data Sets and Metrics Firstly, we choose cell phone as the study object because fraud in cell phone is more serious than other commodities in china and collect cell phone data from Taobao from December 2011 to March The fraud behaviors considered in this study are mainly misrepresentation of items, fee stacking and non-delivery of items. After data filtering and cleaning, blank and incorrect data was deleted. The final data is composed by transaction records in which are fraud transactions, and the number of non-fraud transaction is much larger than that of fraud cases. The data set is transformed to a simple data set which is more suitable for learning. In this way, data can be more meaningful and more easily handled. Then under-sampling approach will be used in the data. In this research the ratio of non-fraud to fraud is approximately 19. We hypothesize the ratio of nonfraud to fraud in each subsamples are similar and ensure it approximately equals to 1. Through under-sampling approach, 19 sub-samples are generated. After under-sampling the majority samples, each sub-sample and minority sample together form 19 sub-training sets of our study. The experiments in this paper adopt a ten-fold cross-validation method. Each data set will be divided into ten equal parts, using nine folds as the training set and the remaining block as an independent test set. According to the researches in papers [9,16], Positive Accuracy, Negative Accuracy, F-measure, and G-Mean are used to evaluate the performance of the C2C e-commerce fraud detection model in this paper. Performance metrics are commonly calculated using the confusion matrix. In this study, True Positives denote correctly identifying fraud transaction, and True Negatives represent correctly identifying non-fraud transaction. Similarly, False Positives and False Negatives denote incorrectly identifying fraud transaction as non-fraud transaction and incorrectly identifying non-fraud transaction as fraud transaction [10]. 4.2 Experimental Results The objective of C2C e-commerce fraud detection model is identifying whether a transaction is fraud or non-fraud. In our study, the model will be trained in different ways, which include NB classifier (NB), combined classifier based on C4.5 (cc45), combined classifier based on AdaBoost&C4.5 (cadac45), combined classifier based on NB and C4.5 (cnbc45) and combined classifier based on NB and AdaBoost&C4.5 (cnbadac45). Table 2. Classification performances of five classification algorithms Performance Performance Metrics (%) Positive Accuracy Negative Accuracy F-measure G-mean NB cc45 cadac45 cnbc45 cnbadac Through passing testing set into different classifiers trained before, we can get the classification performance of each experiment which is shown in Table 2. From this table, we can see that all these five classifiers can indeed detect the fraud transactions. We can compare the accuracy of different classifiers in the C2C e-commerce fraud detection model. The C2C e-commerce fraud detection model can indeed greatly improve all the accuracies and enhance the classification performance. cnbadac45 has the best four accuracies, except for Negative Accuracy. Second, the imbalance problem in C2C e- commerce fraud detection must be resolved. As one can see, NB classifier has the lowest Positive Accuracy, F-measure and G-mean. But the accuracies of the four C4.5 classifiers using balance data 509

7 are very high and more than those of NB. Third, we want to make sure that adding AdaBoost to classifier can affect the classification performance. After adding AdaBoost to cnbc45, cnbadac45 have a much better performance than cc45. Finally, the relation between NB and classification accuracies needs to be discovered. NB has the worst performance. However, the combined classifier based on NB and other algorithms shows higher accuracies than NB. If NB is added to cc45, all accuracies will significantly fall. While we choose cnbadac45 in the model, it has the best performance and obviously improves all the accuracies compared to cadac45, but can not obviously improves accuracies compared to cc45. Out of the five classifiers examined, all except NB result in an efficient classifier to detect fraud transactions. Through the classification results of NB, we can find that NB is good at identifying minority class which is fraud transaction and bad at identifying majority class which is non-fraud transactions. In order to eliminate the bias to majority class of C4.5 classifier in imbalance data, our proposed sampling approaches described in Section 4.2 are used to build the training sets. The training sets for these C4.5 classifiers are balance data sets generated by under-sampling approach. All four classifiers based on C4.5 have high accuracies in classification. This sampling technique does not merely resolve the imbalance problem, but it also generates several sub-classifiers to combine the final classification results through majority voting method. Given the balance training set, the performance of a classifier based on C4.5 is relatively good. We also find that adding AdaBoost algorithm to classifier can either improve or worsen the classification performance. In our research, cnbadac45 has overall better performance than cnbc45 and cadac45 has worse performance than cc45, proving that AdaBoost can help a classifier improve its accuracies when it is used in the right place. Then, through the comparison in all classifiers, cnbadac45 exhibits the best performance in classification. However, the removal of NB and AdaBoost from classifier which is cc45 shows a little worse performance. The difference between it and cc45 is very small. So, cc45 and cnbadac45 are both good at fraud detection in C2C e-commerce. So, cnbadac45 is the best-performance classifier. The capability of detecting potential fraud transactions is the most required capability in C2C e- commerce fraud detection. Another important finding of the research is that NB can detect the most undetected fraud transactions because of the capability of NB to reveal potential undetected abnormalities in large-scale data. After combining NB and cadac45, we can get overall best classification performance to identify detected and undetected fraud transactions in C2C e-commerce in the five classifiers studied in the research. 5. Conclusion and future work This study proposes a C2C e-commerce fraud detection model for classifying imbalanced data. In C2C e-commerce, the advantages of model using cnbadac45 include good accuracies and the best performance in identifying undetected fraud transactions in all five examined classifiers. It not only improves the classification performance, but also helps to identify the potential fraud transactions. After combining C2C e-commerce fraud detection model with C2C e-commerce systems, we can detect fraud transaction if there are some patterns that do not match normal patterns. Through these preventions, fraud transactions can be stopped before more losses occur and the customer satisfaction will improve. In the end, more and more legal customers take part in e-commerce because of the safe and steady market and fraudsters are excluded from market. Company with the development of e- commerce, there are more and more new customers and transactions entering the C2C market. Old identified patterns will be ineffective and C2C e-commerce systems should carry out model selflearning to adjust patterns for continuous fraud detection. Our study also opens up several directions for future research in fraud detection. This study conducts experiments only for the selected algorithms in the C2C e-commerce fraud detection model. In the future, more studies are needed to use different classification algorithm to detect fraud in C2C e- commerce. For instance, future studies could consider genetic algorithm, BP neural networks and so on. They can also improve the architecture of C2C e-commerce fraud detection model to evaluate those performances. In the study, we only collect data of cell phones, and we can apply our model into different commodities to examine the feasibility and effectiveness of the model. Furthermore, we just 510

8 focus on the binary classification in C2C e-commerce, but there exist several fraudulent types and future studies could examine if the results in this study is still useful. 6. Acknowledgement The research is supported by the National Natural Science Foundation of China under Grant No , the College Students Scientific Research and Undertaking Starting Action Project under Grant No. PXM2012_014213_000067, Research Foundation for Youth Scholars of Beijing Technology and Business University No. QNJ and Scientific Research Common Program of Beijing Municipal Commission of Education No. KM References [1] D. Teeni, D.R. Young, The changing role of nonprofits in the network economy, Nonprofit and Voluntary Sector Quarterly, vol. 32, no. 3, pp , [2] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no.1, pp.63-77, [3] M. Weatherford, Mining for Fraud, IEEE Intelligent Systems, vol. 17, no. 4, pp.4-6, [4] J.T.S. Quah, M. Sriganesh, Real-time credit card fraud detection using computational intelligence, Expert Systems with Applications, vol. 35, no. 4, pp ,2008. [5] Seokjoo Andrew Chang, "Forensic Data Pattern Analysis using Information Entropy", International Journal on Data Mining and Intelligent Information Technology Applications, vol. 2, no. 2, pp.12-20, [6] A. Lin, J. Foster, S.Wang, Understanding the factors that influence acceptance of online auction platforms: a comparative study of Taobao and ebayeachnet. International Journal of Business and Systems Research, vol. 3, no. 2, pp , [7] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, USA, 2005 [8] J.J. Wu, H. Xiong, J. Chen, COG: Local Decomposition for Rare Class Analysis, Data Mining and Knowledge Discovery, vol. 20, no. 2, pp , [9] H. He, E. A. Garcia, Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp , [10] A.R. Sinha, H.M. Zhao, Incorporating domain knowledge into data mining classifiers: An application in indirect lending, Decision Support Systems, vol. 46, no. 1, pp , [11] X. Luo, Q. Zhu, Cost-sensitive ensemble via Adaptive Weighted Cost Proportionate Sampling. International Journal of Digital Content Technology and its Applications, vol. 5, no. 7, pp , [12] H.T. Xiong, Y. Yang, S.X. Zhao. Local Clustering Ensemble Learning Method based on Improved AdaBoost for Rare Class Analysis. Journal of Computational Information Systems, vol. 8, no. 4, p , [13] G. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explorations, vol. 6, no. 1, pp.20-29, [14] M.C. Chen, L.S. Chen, C.C. Hsu, W.R. Zeng, An information granulation based data mining approach for classifying imbalanced data, Information Sciences, vol. 178, no. 16, pp , [15] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory Undersampling for Class-Imbalance Learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 2, pp , [16] X. Li, Research on online trading customer classification based on customer characteristics and behaviors, International Journal of Digital Content Technology and its Applications, vol. 6, no. 10, pp ,