A Novel Classification Approach for C2C E-Commerce Fraud Detection
|
|
|
- Norah Harper
- 10 years ago
- Views:
Transcription
1 A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University, Beijing , China, [email protected] 2 School of Computer and Information Engineering, Beijing Technology and Business University, Beijing , China Abstract Fraud in consumer-to-consumer (C2C) e-commerce is becoming more and more serious. The purpose of this study is to develop an effective fraud detection model to assist customers in identifying potential fraud transactions. We use Naive Bayes (NB), decision tree C4.5 and AdaBoost to construct the model for classifying imbalance transaction data, and majority voting is used to combine the model. Several experiments are conducted on Taobao data set to verify the classification performance of the proposed model using four popular performance metrics. The experimental results demonstrate that the model based on NB and AdaBoost&C4.5 can significantly increase the ability to locate potential fraud transactions in C2C e-commerce. Keywords: Fraud Detection, Decision Tree, Adaboost, Imbalance Data, Classification 1. Introduction The fast and wide development of Internet has made C2C e-commerce become more and more popular because of low cost and high efficiency. During the high development of C2C e-commerce, hidden problems have been exposed. The virtual internet transaction will make it not easy to check the identification of both sides in a transaction and customers have difficulty in buying products because of asymmetric information of product quality. Therefore, the lemon effect will occur [1], and it is hard to find a feasible solution for this problem [2-5]. Buyers take this incentive into consideration, and deem the quality of goods to be uncertain. Only goods with average quality will be considered, which in turn will have a side effect that goods with above average quality will be driven out of the market, ultimately leading to the destruction of the market. Currently, the reputation systems chosen by most C2C e-commerce sites to prevent fraud mainly use simple summation or average of ratings. Summation of ratings is simply to sum the number of positive ratings and negative ratings separately, and to keep a total score as the positive score minus the negative score. The reputation systems adopted by ebay and Taobao use summation of ratings [6]. Average of ratings is to compute the reputation score as the average of all ratings which is used by Amazon. Reputation systems can hardly depict the trader s true reputation, and often be attacked by fraudsters. As C2C e-commerce develops, more and more buyers and sellers participate in it. Meanwhile, the number of fraud transaction also runs up remarkably. Non-fraud transaction is represented by a large number of transactions while fraud transaction is represented by only a few. So it is extremely difficult to extract the fraud patterns in C2C e-commerce and cause the class imbalance problem. In this study, we propose an innovative C2C e-commerce fraud detection model based on NB and AdaBoost&C4.5 to classify imbalance transaction data. For the purpose of building the model, we examine the components of the model and the architecture is revised. Then the capability of discriminating abnormal transactions from experiments on Taobao data set will be evaluated. This paper is organized as follows: Section 2 describes the methodology in the model. Section 3 describes the classification mechanism of C2C ecommerce fraud detection model. Section 4 details our data and performance metrics, followed by experiments of the model and results. Finally, the conclusions of this study and future work are provided in Section 5. International Journal of Digital Content Technology and its Applications(JDCTA) Volume7,Number1,January 2013 doi: /jdcta.vol7.issue
2 2. Research methodology In the following section, we will discuss the research methodology used in this study and the main components involved. 2.1 Learning algorithms In fraud detection research, there are several widely used classification algorithms which are naïve Byes, C4.5 decision tree, AdaBoost and so on[7-9]. 1) Naive Bayes Naive Bayes is a simple probabilistic classifier based on applying Bayes theorem with naïve independence assumptions. It assumes that the presence or absence of a particular feature of a class is unrelated to the presence or absence of any other feature [7]. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifier often work much better in many complex real-world situations than one might expect. NB can give better predictive accuracy than other algorithms such as C4.5 and BP when attributes are normally distributed and not redundant. While attributes are not normally distributed and redundant, it will show lower predictive accuracy. 2) Decision tree C4.5 Decision tree is a kind of decision support techniques that uses a tree-like graph or model of decisions and their possible consequences. In machine learning, decision tree is a predictive model that is a mapping from observations about an item to conclusions about its target value. Leaves represent classifications and branches represent conjunctions of features that lead to those classifications [8]. Most decision tree inducers assume that the overall prediction decision can be made by dividing the decision into a sequence of small decisions. Different decision tree inducers mainly differ in the goodness measure used to select the splitting attribute at each intermediate tree node [10]. C4.5 is a decision tree algorithm which can not only make accurate predictions but also explain hidden patterns in data. It can deal with numeric attributes, missing values, estimating error rates and generating rules. C4.5 is one of the most commonly used algorithms in the data mining and machine learning communities and C4.5 combined with under-sampling or over-sampling is quickly becoming the accepted baseline to beat in research of class imbalance.in prediction accuracy, C4.5 performs better than CART and ID3. However, C4.5 may cause scalability and over-fitting problems when it is applied on large data sets. 3) AdaBoost Boosting is one of the most powerful machine learning approaches to emerge in the past decade. AdaBoost, as a kind of boost, is a meta-algorithm. It can be used to linearly combine many other learning algorithms to correct the misclassifications made by weak classifiers. It is sensitive to noisy data and outliers and less susceptible to the over-fitting problem than most learning algorithms are [11]. If simple weak classifiers are used, the AdaBoost algorithm is very fast [12]. AdaBoost is an ensemble constructing technique and has become a very popular one for its simplicity and adaptability. There are two approaches implemented in AdaBoost: one is re-weighting, and the other is re-sampling. In order to accompany with imbalanced data processing, AdaBoost with re-weighting is used in our study. Using re-weighting approach, all training samples with weights are used in each sample to train the final classifier. It calls a weak classifier repeatedly in a series of rounds t =1,...,T. For each call, a distribution of weights Dt is updated that indicates the importance of examples in the data set for the classification. And the weights of each incorrectly classified example are increased, so that the new classifier focuses more on those examples. All three classifier algorithms have advantages and disadvantages. According to the preceding introduction of algorithms, we can summarize the performance of the three algorithms which are shown in Table 1. Therefore, by using the three algorithms together on the same data, their strengths can be combined and their weaknesses reduced. 505
3 Table 1. Comparison of three algorithms Algorithm Accuracy Scalability Speed Over-fitting NB C4.5 AdaBoost Good Excellent Excellent Excellent Poor Poor Excellent Good Poor Low High Medium 2.2. Imbalanced data processing: Under-sampling Usually, classification algorithms perform not well while handling imbalanced data and the results are biased to the majority class. When learning from imbalance data, traditional classification algorithms tend to produce high prediction for the majority class but low for the minority class [9,13,14]. So the data tends to be classified to majority class which is always the meaningful class we want to get. Traditional classifiers can not have good performance while dealing with imbalanced learning tasks. Hence, unbalanced data should be handled before being applied to classification algorithms. Under-sampling is used in our model to handle imbalanced data. Majority under-sampling is more preferable than minority over-sampling technique [13]. Under-sampling is a kind of approaches that can handle the imbalance data. It is a method in which the minority class remains intact, while the majority class is under-sample. Through using cost curves to explore the interaction of over and undersampling with the decision tree learner C4.5, under-sampling produces a reasonable sensitivity to changes in misclassification costs distribution. On the other hand, over-sampling is surprisingly ineffective, often producing little or no change in performance Combination technology: Majority voting Majority voting method is a kind of combination technologies. Among all the combination technologies, it is by far the simplest for implementation. It does not assume prior knowledge of behavior of the individual classifiers, and it does not require training on large quantities of representative recognition results from classifiers [15]. While employing five combination technologies, which include majority vote, Bayesian, logistic regression, fuzzy integral and neural network, on seven classifiers, majority vote is just as effective as the other more complicated technologies in improving the recognition rate. In combining the decisions of the n classifiers using majority voting method, the sample is assigned the class when there is a consensus, or when more than half of the classifiers agreed on the identity. Otherwise, the sample is rejected. 3. Fraud detection mechanism design There are some fraudulent characteristics in C2C e-commerce. Firstly, changes in the trading characteristics of a trader, such as the types of commodity, turnover, and trading frequency, can be detected as the evidence of abnormal trading. Secondly, Fraud can be detected by finding out similar traits from the account information of sellers to learn rules because sellers with similar background are likely to behave in the same way. In order to detect fraud through related characteristics, it is necessary to gather transaction information, account information and reputation information of sellers. Since there are some special traits of fraud transaction, a functional relationship between transaction attributes and transaction types (fraud or non-fraud) will help us to detect fraud from a large amount of trading data. The purpose of this research is to design an effective and efficient fraud detection model used in C2C e-commerce. In fraud detection, there is always an imbalance between the positive sample representing fraud ones and negative sample representing no-fraud ones. However, classification algorithms tend to ignore the minority class but present an accurate classification for the majority class. Therefore, conventional algorithms are limited in the classification of imbalanced data, but an efficient fraud detection model should focus on the fraud sample (minority class). Thus, the data in the two types of sample should be balanced first and then be trained for the fraud detection model. To solve a complicated problem of detection classification, a single method can hardly meet the detection requirements. Because of the complementarities of classification, different classifiers need to be combined to reduce detection errors and improve detection robustness. In this research, we tend to take 506
4 full consideration of combined classifier in fraud detection to avoid imprudent decisions resulted from using a single classifier. 3.1 Mechanism of C2C e-commerce fraud detection model Figure 1. Mechanism of C2C e-commerce fraud detection model C2C e-commerce fraud detection model is designed to detect abnormal transaction data from transaction data and help users to make decision in selecting commodities. The model has several different steps that begin with the preprocessing of raw data. After that, the data is ready to be utilized for classification. The following step is NB classifier, which is the first classifier. NB classifier can denote the distribution of the training set in the classification and refine the input of next classifier. The third step is the second classifier, a combined classifier based on AdaBoost&C4.5 which is generated using sub-training sets whose sampling approaches are proposed later. It uses majority voting method to combine all sub-classifers in it. The results of the NB classifier are the inputs of the second classifier and classification prediction results can be got from the second classifier. Finally, several performance 507
5 metrics will be conducted on the prediction results. The complete mechanism of C2C e-commerce fraud detection model is shown in Fig Sampling process in C2C e-commerce fraud detection model Figure 2. Generation of samples for classification Figure 2 depicts the sampling model in C2C e-commerce fraud detection to tackle class imbalance problem. The model is a combination of random sampling and under-sampling. According to the qualitative introduction in 3.1, NB classifier is generated to denote the distribution of sample data and C4.5 classifier is generated to get more accurate decision tree rules. So the training set for NB classifier uses random sampling approach to get a part of transaction data as the training set and the sub-training sets for combined C4.5 classifier use under-sampling approach on the former training set. In undersampling, the majority sample is randomly under-sampled and the union of sub-samples account for all majority sample. Initially, the data set is separated into two different sub data sets through random sampling: training set and testing set. The testing set is used to test the classification performance of the model and the training set is input to NB classifier to generate NB classification model. Then, the training set is splitting into two parts. One is the fraud sample and the other is non-fraud sample. After that, nonfraud sub-samples are randomly under-sampled from the majority sample which is non-fraud sample in such 508
6 a way that the ratio of fraud to non-fraud is approximately 1. Later, each non-fraud subsamples are combined with fraud sample to generate sub-training sets for a combined classifier based on AdaBoost&C Experiment and analysis 4.1 Experimental Data Sets and Metrics Firstly, we choose cell phone as the study object because fraud in cell phone is more serious than other commodities in china and collect cell phone data from Taobao from December 2011 to March The fraud behaviors considered in this study are mainly misrepresentation of items, fee stacking and non-delivery of items. After data filtering and cleaning, blank and incorrect data was deleted. The final data is composed by transaction records in which are fraud transactions, and the number of non-fraud transaction is much larger than that of fraud cases. The data set is transformed to a simple data set which is more suitable for learning. In this way, data can be more meaningful and more easily handled. Then under-sampling approach will be used in the data. In this research the ratio of non-fraud to fraud is approximately 19. We hypothesize the ratio of nonfraud to fraud in each subsamples are similar and ensure it approximately equals to 1. Through under-sampling approach, 19 sub-samples are generated. After under-sampling the majority samples, each sub-sample and minority sample together form 19 sub-training sets of our study. The experiments in this paper adopt a ten-fold cross-validation method. Each data set will be divided into ten equal parts, using nine folds as the training set and the remaining block as an independent test set. According to the researches in papers [9,16], Positive Accuracy, Negative Accuracy, F-measure, and G-Mean are used to evaluate the performance of the C2C e-commerce fraud detection model in this paper. Performance metrics are commonly calculated using the confusion matrix. In this study, True Positives denote correctly identifying fraud transaction, and True Negatives represent correctly identifying non-fraud transaction. Similarly, False Positives and False Negatives denote incorrectly identifying fraud transaction as non-fraud transaction and incorrectly identifying non-fraud transaction as fraud transaction [10]. 4.2 Experimental Results The objective of C2C e-commerce fraud detection model is identifying whether a transaction is fraud or non-fraud. In our study, the model will be trained in different ways, which include NB classifier (NB), combined classifier based on C4.5 (cc45), combined classifier based on AdaBoost&C4.5 (cadac45), combined classifier based on NB and C4.5 (cnbc45) and combined classifier based on NB and AdaBoost&C4.5 (cnbadac45). Table 2. Classification performances of five classification algorithms Performance Performance Metrics (%) Positive Accuracy Negative Accuracy F-measure G-mean NB cc45 cadac45 cnbc45 cnbadac Through passing testing set into different classifiers trained before, we can get the classification performance of each experiment which is shown in Table 2. From this table, we can see that all these five classifiers can indeed detect the fraud transactions. We can compare the accuracy of different classifiers in the C2C e-commerce fraud detection model. The C2C e-commerce fraud detection model can indeed greatly improve all the accuracies and enhance the classification performance. cnbadac45 has the best four accuracies, except for Negative Accuracy. Second, the imbalance problem in C2C e- commerce fraud detection must be resolved. As one can see, NB classifier has the lowest Positive Accuracy, F-measure and G-mean. But the accuracies of the four C4.5 classifiers using balance data 509
7 are very high and more than those of NB. Third, we want to make sure that adding AdaBoost to classifier can affect the classification performance. After adding AdaBoost to cnbc45, cnbadac45 have a much better performance than cc45. Finally, the relation between NB and classification accuracies needs to be discovered. NB has the worst performance. However, the combined classifier based on NB and other algorithms shows higher accuracies than NB. If NB is added to cc45, all accuracies will significantly fall. While we choose cnbadac45 in the model, it has the best performance and obviously improves all the accuracies compared to cadac45, but can not obviously improves accuracies compared to cc45. Out of the five classifiers examined, all except NB result in an efficient classifier to detect fraud transactions. Through the classification results of NB, we can find that NB is good at identifying minority class which is fraud transaction and bad at identifying majority class which is non-fraud transactions. In order to eliminate the bias to majority class of C4.5 classifier in imbalance data, our proposed sampling approaches described in Section 4.2 are used to build the training sets. The training sets for these C4.5 classifiers are balance data sets generated by under-sampling approach. All four classifiers based on C4.5 have high accuracies in classification. This sampling technique does not merely resolve the imbalance problem, but it also generates several sub-classifiers to combine the final classification results through majority voting method. Given the balance training set, the performance of a classifier based on C4.5 is relatively good. We also find that adding AdaBoost algorithm to classifier can either improve or worsen the classification performance. In our research, cnbadac45 has overall better performance than cnbc45 and cadac45 has worse performance than cc45, proving that AdaBoost can help a classifier improve its accuracies when it is used in the right place. Then, through the comparison in all classifiers, cnbadac45 exhibits the best performance in classification. However, the removal of NB and AdaBoost from classifier which is cc45 shows a little worse performance. The difference between it and cc45 is very small. So, cc45 and cnbadac45 are both good at fraud detection in C2C e-commerce. So, cnbadac45 is the best-performance classifier. The capability of detecting potential fraud transactions is the most required capability in C2C e- commerce fraud detection. Another important finding of the research is that NB can detect the most undetected fraud transactions because of the capability of NB to reveal potential undetected abnormalities in large-scale data. After combining NB and cadac45, we can get overall best classification performance to identify detected and undetected fraud transactions in C2C e-commerce in the five classifiers studied in the research. 5. Conclusion and future work This study proposes a C2C e-commerce fraud detection model for classifying imbalanced data. In C2C e-commerce, the advantages of model using cnbadac45 include good accuracies and the best performance in identifying undetected fraud transactions in all five examined classifiers. It not only improves the classification performance, but also helps to identify the potential fraud transactions. After combining C2C e-commerce fraud detection model with C2C e-commerce systems, we can detect fraud transaction if there are some patterns that do not match normal patterns. Through these preventions, fraud transactions can be stopped before more losses occur and the customer satisfaction will improve. In the end, more and more legal customers take part in e-commerce because of the safe and steady market and fraudsters are excluded from market. Company with the development of e- commerce, there are more and more new customers and transactions entering the C2C market. Old identified patterns will be ineffective and C2C e-commerce systems should carry out model selflearning to adjust patterns for continuous fraud detection. Our study also opens up several directions for future research in fraud detection. This study conducts experiments only for the selected algorithms in the C2C e-commerce fraud detection model. In the future, more studies are needed to use different classification algorithm to detect fraud in C2C e- commerce. For instance, future studies could consider genetic algorithm, BP neural networks and so on. They can also improve the architecture of C2C e-commerce fraud detection model to evaluate those performances. In the study, we only collect data of cell phones, and we can apply our model into different commodities to examine the feasibility and effectiveness of the model. Furthermore, we just 510
8 focus on the binary classification in C2C e-commerce, but there exist several fraudulent types and future studies could examine if the results in this study is still useful. 6. Acknowledgement The research is supported by the National Natural Science Foundation of China under Grant No , the College Students Scientific Research and Undertaking Starting Action Project under Grant No. PXM2012_014213_000067, Research Foundation for Youth Scholars of Beijing Technology and Business University No. QNJ and Scientific Research Common Program of Beijing Municipal Commission of Education No. KM References [1] D. Teeni, D.R. Young, The changing role of nonprofits in the network economy, Nonprofit and Voluntary Sector Quarterly, vol. 32, no. 3, pp , [2] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no.1, pp.63-77, [3] M. Weatherford, Mining for Fraud, IEEE Intelligent Systems, vol. 17, no. 4, pp.4-6, [4] J.T.S. Quah, M. Sriganesh, Real-time credit card fraud detection using computational intelligence, Expert Systems with Applications, vol. 35, no. 4, pp ,2008. [5] Seokjoo Andrew Chang, "Forensic Data Pattern Analysis using Information Entropy", International Journal on Data Mining and Intelligent Information Technology Applications, vol. 2, no. 2, pp.12-20, [6] A. Lin, J. Foster, S.Wang, Understanding the factors that influence acceptance of online auction platforms: a comparative study of Taobao and ebayeachnet. International Journal of Business and Systems Research, vol. 3, no. 2, pp , [7] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, USA, 2005 [8] J.J. Wu, H. Xiong, J. Chen, COG: Local Decomposition for Rare Class Analysis, Data Mining and Knowledge Discovery, vol. 20, no. 2, pp , [9] H. He, E. A. Garcia, Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp , [10] A.R. Sinha, H.M. Zhao, Incorporating domain knowledge into data mining classifiers: An application in indirect lending, Decision Support Systems, vol. 46, no. 1, pp , [11] X. Luo, Q. Zhu, Cost-sensitive ensemble via Adaptive Weighted Cost Proportionate Sampling. International Journal of Digital Content Technology and its Applications, vol. 5, no. 7, pp , [12] H.T. Xiong, Y. Yang, S.X. Zhao. Local Clustering Ensemble Learning Method based on Improved AdaBoost for Rare Class Analysis. Journal of Computational Information Systems, vol. 8, no. 4, p , [13] G. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explorations, vol. 6, no. 1, pp.20-29, [14] M.C. Chen, L.S. Chen, C.C. Hsu, W.R. Zeng, An information granulation based data mining approach for classifying imbalanced data, Information Sciences, vol. 178, no. 16, pp , [15] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory Undersampling for Class-Imbalance Learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 2, pp , [16] X. Li, Research on online trading customer classification based on customer characteristics and behaviors, International Journal of Digital Content Technology and its Applications, vol. 6, no. 10, pp ,
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
CLASS imbalance learning refers to a type of classification
IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, PART B Multi-Class Imbalance Problems: Analysis and Potential Solutions Shuo Wang, Member, IEEE, and Xin Yao, Fellow, IEEE Abstract Class imbalance problems
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
On the application of multi-class classification in physical therapy recommendation
RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation
Direct Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Class Imbalance Learning in Software Defect Prediction
Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang [email protected] University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
III. DATA SETS. Training the Matching Model
A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: [email protected] Prem Melville IBM T.J. Watson
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring
714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR. [email protected]
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR Bharti S. Takey 1, Ankita N. Nandurkar 2,Ashwini A. Khobragade 3,Pooja G. Jaiswal 4,Swapnil R.
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Operations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 [email protected]
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
DATA PREPARATION FOR DATA MINING
Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI
ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION
ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University [email protected] Taghi M. Khoshgoftaar
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Electronic Payment Fraud Detection Techniques
World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141, 2012 Electronic Payment Fraud Detection Techniques Adnan M. Al-Khatib CIS Dept. Faculty of Information
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
Addressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Decision Support Systems
Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
To improve the problems mentioned above, Chen et al. [2-5] proposed and employed a novel type of approach, i.e., PA, to prevent fraud.
Proceedings of the 5th WSEAS Int. Conference on Information Security and Privacy, Venice, Italy, November 20-22, 2006 46 Back Propagation Networks for Credit Card Fraud Prediction Using Stratified Personalized
Expert Systems with Applications
Expert Systems with Applications 36 (2009) 5445 5449 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Customer churn prediction
Customer Relationship Management using Adaptive Resonance Theory
Customer Relationship Management using Adaptive Resonance Theory Manjari Anand M.Tech.Scholar Zubair Khan Associate Professor Ravi S. Shukla Associate Professor ABSTRACT CRM is a kind of implemented model
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Using One-Versus-All classification ensembles to support modeling decisions in data stream mining
Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa [email protected]
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
A Comparative Study on Sentiment Classification and Ranking on Product Reviews
A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Introduction to Data Mining Techniques
Introduction to Data Mining Techniques Dr. Rajni Jain 1 Introduction The last decade has experienced a revolution in information availability and exchange via the internet. In the same spirit, more and
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model
A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model Twinkle Patel, Ms. Ompriya Kale Abstract: - As the usage of credit card has increased the credit card fraud has also increased
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Learning with Skewed Class Distributions
CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Research on the UHF RFID Channel Coding Technology based on Simulink
Vol. 6, No. 7, 015 Research on the UHF RFID Channel Coding Technology based on Simulink Changzhi Wang Shanghai 0160, China Zhicai Shi* Shanghai 0160, China Dai Jian Shanghai 0160, China Li Meng Shanghai
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
On the application of multi-class classification in physical therapy recommendation
On the application of multi-class classification in physical therapy recommendation Jing Zhang 1, Douglas Gross 2, and Osmar R. Zaiane 1 1 Department of Computing Science, 2 Department of Physical Therapy,
