Data Mining Techniques in Fraud Detection
|
|
|
- Matthew Garrett
- 10 years ago
- Views:
Transcription
1 Data Mining Techniques in Fraud Detection Rekha Bhowmik University of Texas at Dallas ABSTRACT The paper presents application of data mining techniques to fraud analysis. We present some classification and prediction data mining techniques which we consider important to handle fraud detection. There exist a number of data mining algorithms and we present statistics-based algorithm, decision treebased algorithm and rule-based algorithm. We present Bayesian classification model to detect fraud in automobile insurance. Naïve Bayesian visualization is selected to analyze and interpret the classifier predictions. We illustrate how ROC curves can be deployed for model assessment in order to provide a more intuitive analysis of the models. Keywords: Data Mining, Decision Tree, Bayesian Network, ROC Curve, Confusion Matrix 1. INTRODUCTION Data mining refers to extracting or mining knowledge from large amount of data. There are a number of data mining techniques like clustering, neural networks, regression, multiple predictive models. Here, we discuss only few techniques of data mining which would be considered important to handle fraud detection. They are i) Bayesian network, for classifying risk group, and ii) Decision tree, for creating descriptive model of each risk group. Data Mining is associated with (a) supervised learning based on training data of known fraud and legitimate cases and (b) unsupervised learning with data that are not labeled to be fraud or legitimate. Bedford s law can be interpreted as an example of unsupervised learning (Bolton et al. 2002). The direct application of these methods to forensic accounting is limited due to almost complete nonexistence of large sets of fraud training data (Bolton et al. 2002; Jensen, 1997). Insurance fraud, credit card fraud, telecommunications fraud, and check forgery are some of the main types of fraud. Insurance fraud is common in automobile, travel. The Uniform Suspected Insurance Fraud Reporting Form, adopted by the NAIC Antifraud Task Force 2003, replaced the prior Task Force form. This form standardizes insurance fraud data for the insurance industry and makes it easier to report and track. Fraud detection involves three types of offenders (Baldock, 1997): i) Criminal offenders, ii) organized criminal offenders who are responsible for major fraud, and iii) offenders who commit fraud (called soft 35
2 fraud) when suffering from financial hardship. Soft fraud is the hardest to lessen because the cost for each suspected incident is usually higher than the cost of the fraud (National White Collar Crime Center, 2003). Types i) and ii) offenders, called hard fraud, avoid anti-fraud measures (Sparrow, 2002). We present data mining techniques which are most appropriate for fraud analysis. We present automobile insurance example. Three data mining techniques used for fraud analysis are: i) Bayesian network, ii) Decision tree, and iii) backpropagation. Bayesian network is the technique used for classification task. Classification, given a set of predefined categorical classes, determines which of these classes a specific data belongs to. Decision trees are used to create descriptive models. Descriptive models are created to describe the characteristics of fault. The remainder of this paper is organized as follows. In Section 2, we present the existing fraud detection systems and techniques. Section 3 the three classification algorithms and application. Section 4 presents the data. Finally, in section 5, we discuss the important features of our work and further work. 2. EXISTING FRAUD DETECTION SYSTEMS A fuzzy logic system (Altrock et al. 1995) incorporated the actual fraud evaluation policy using optimum threshold values. The result showed the chances of fraud and the reasons why an insurance claim is fraudulent. This system predicted slightly better results than the auditors. Another logic system (Cox et al. 1995) used two approaches to imitate the reasoning of fraud experts, i) the discovery model, uses an unsupervised neural network to find the relationships in data and to find clusters, then patterns within the clusters are identified, and ii) the fuzzy anomaly detection model, which used Wang-Mendel algorithm to find how health care providers committed fraud against insurance companies. The EFD system (Major et al. 1995) integrated the expert knowledge with statistical information to identify providers whose behavior did not fit the rule. The hot spots methodology (Williams et al. 1997) performed a three step process: i) k-means clustering algorithm for cluster detection is used because the other clustering algorithms tend to be computationally expensive where the datasets are very large, ii) C4.5 algorithm, the resulting decision tree can be converted to a rule set and pruned, and iii) visualization tools for rule evaluation, building statistical summaries of the entities associated with each rule. (Williams, 1999) extended the hot spots methodology to use genetic algorithms to generate and explore the rules. The credit fraud model (Groth et al. 1998) suggested a classification technique with fraud/legal attribute, and a clustering followed by a classification technique with no fraud/legal attribute. Kohonen's Self-Organizing Feature Map (Brockett et al. 1998) was used to categorize automobile injury claims depending on the size of fraud suspicion. The validity of the Feature Map was then evaluated using a back propagation algorithm and feed forward neural networks. Result showed that the 36
3 method was more reliable and consistent compared to the fraud assessment. Classification techniques have proved to be very effective in fraud detection (He et al. 1998; Chen et al. 1999) and therefore, can be applied to categorize crime data. The distributed data mining model (Chen et al. 1999) uses a realistic cost model to evaluate C4.5, CART, and naïve Bayesian classification models. The method was applied to credit card transactions. The neural data mining approach (Brause et al. 1999) uses rule-based association rules to mine symbolic data and Radial Basis Function neural network to mine analog data. The approach discusses the importance of use of non-numeric data in fraud detection. It was found that the results of association rules increased the predictive accuracy. SAS Enterprise Miner Software (SAS e-intelligence, 2000) depends on association rules, cluster detection and classification techniques to detect fraudulent claims. The Bayesian Belief Network (BBN) and Artificial Neural Network (ANN) study used the STAGE algorithm for BBN in fraud detection and backpropagation for ANN (Maes et al. 2002). STAGE repeatedly alternates between two stages of search: running the original search method on objective function, and running hill-climbing to optimize the value function. The result shows that BBNs were much faster to train, but were slower when applied to new instances. FraudFocus Software (Magnify, 2002) automatically scores all claims. The scores are sorted in descending order of fraud potential and generate descriptive rules for fraudulent claims. FairIsaac(Weatherford et al. 2002) recommended backpropagation neural networks for fraudulent credit card use. The ASPECT group (Weatherford et al. 2002) focused on neural networks to train current user profiles and user profiles histories. A caller s current profile and the profile history are compared to find probable fraud. (Cahill et al. 2002) build on the adaptive fraud detection framework (Fawcett et al. 1997) by applying an event-driven approach of assigning fraud scores to detect fraud. The (Cahill et al. 2002) framework can also detect types of fraud using rules. This framework has been used in both wireless and wired fraud detection systems. (Ormerod el al. 2003) used dynamic BBNs called Mass Detection tool to detect fraudulent claims, which then used a rule generator called Suspicion Building Tool. The different types of fraud detection are: internal, insurance, credit card, and telecommunications fraud detection. Internal fraud detection consists in determining fraudulent financial reporting by management (Lin et al. 2003; Bell et al. 2000), and abnormal retail transactions by employees (Kim et al. 2003). There are four types of insurance fraud detection: home insurance (Bentley, 2000; Von Altrock, 1997), crop insurance (Little et al. 2002), automobile insurance fraud detection(phua et al. 2004; Viaene et al. 2004; Brockett et al. 2002; Stefano et al. 2001; Belhadji et al. 2000), and health insurance (Yamanishi et al. 2004; Riedinger et al. 2002). A single metaclassifier(phua et al. 2004) is used to select the best base classifiers, and then combined with these base classifiers predictions to improve cost savings (stackingbagging). Automobile insurance fraud detection data set was used to demonstrate the stacking-bagging problem. Credit card fraud detection refers to screening credit 37
4 applications (Wheeler et al. 2000), and/or logged credit card transactions (Foster et al. 2004; Fan, 2004; Chen et al. 2004; Chiu et al. 2004; Kim et al. 2002; Maes et al. 2002; Syeda et al. 2002). Telecommunications subscription data (Cortes et al. 2003; Cahill et al. 2002; Rosset et al. 1999; Moreau et al. 1997), and/or wired and wireless phone calls (Kim et al. 2003; Burge et al. 2001) are monitored. Credit transactional fraud detection has been presented by (Foster et al. 2004) and bad debts prediction (Ezawa et al. 1996). Employee/retail (Kim et al. 2003), national crop insurance (Little et al. 2002), and credit application (Wheeler et al. 2000). Literature focus on video-on-demand websites (Barse et al. 2003) and IP-based telecommunication services (McGibney et al. 2003). Online sellers (Bhargava et al. 2003) and online buyers (Sherman, 2002) can be monitored by automated systems. Fraud detection in government organisations such as tax (Bonchi et al. 1999) and customs (Shao et al. 2002) has also been reported. We discuss below supervised data mining technique to detect crime using Bayesian Belief Networks, Decision trees, and Artificial Neural Networks. 2.1 Bayesian Belief Networks Bayesian Belief Networks provide a graphic model of causal relationships on which class membership probabilities(han et al. 2000) are predicted, so that a given instance is legal or fraud (Prodromidis, 1999). Naïve Bayesian classification assumes that the attributes of an instance are independent, given the target attribute (Feelders et al. 2003). The aim is to assign a new instance to the class that has the highest posterior probability. The algorithm is very effective and can give better predictive accuracy when compared to C4.5 decision trees and backpropagation (Domingos et al. 1996; Elkan et al. 2001). However, when the attributes are redundant, the predictive accuracy is reduced (Witten et al. 1999). 2.2 Decision Trees Decision trees are machine learning techniques that express independent attributes and a dependent attribute in a tree-shaped structure that represents a set of decisions (Witten et al. 1999). Classification rules, extracted from decision trees, are IF-THEN expressions in which the preconditions are logically ANDed and all the tests have to succeed if each rule is to be generated. The related applications include the analysis of instances from drug smuggling, governmental financial transactions (Mena et al. 2003), and customs declaration fraud (Shao et al. 2002) to more serious crimes such as drug related homicides, serial sex crimes (SPSS, 2003), and homeland security(james et al. 2002; Mena et al. 2003). Data mining methods have solved security and criminal detection problems. [Mena, 2003] reviewed the subject (intelligent agents, link analysis, text mining, decision trees, self-organizing maps, machine learning, and neural networks) for security managers, law enforcement investigators, counter-intelligence agents, fraud specialists, and information security analysts. C4.5 (Quinlan et al. 1993) is used to divide data into segments based and to generate descriptive classification rules that can be used to classify a new instance. 38
5 C4.5 can help to make predictions and to extract crime patterns. It generates rules from trees (Witten et al.., 1999) and handles numeric attributes, missing values, pruning, and estimating error rates. C4.5 performs slightly better than CART and ID3 (Prodromidis, 1999) in terms of predictive accuracy. The learning and classification steps are generally fast(han et al. 2000). However, performance decrease can occur when C4.5 is applied to large datasets. C5.0 shows marginal improvements to decision tree induction. 2.3 Artificial Neural Networks Artificial Neural Networks represent complex mathematical equations with summations, exponentials, and parameters to copy neurons (Berry et al. 2000). They have been applied to classify crime instances such as burglary, sexual offences, and known criminals facial characteristics (Mena et al. 2003b). Backpropagation neural networks can process a large number of instances with tolerance to noisy data and has the ability to classify patterns on which they have not been trained (Han et al. 2000). They are appropriate where the results of the model are more important (Berry et al. 2000). However, backpropagation require long training hours, extensive testing, retaining parameters like the number of hidden neurons, learning rate (Bigus, 1996). 3. APPLICATION The steps in crime detection are: i) classifiers, ii) integrate multiple classifiers, iii) ANN approach to clustering, and iv) visualization techniques to describe the patterns. 3.1 Bayesian Network Bayesian Network is a Directed Acyclic Graph, where each node represents a random variable and is associated with the conditional probability of the node given its parents. This model shows each variable in a given domain as a node in the graph and dependencies between these variables as arcs connecting the respective nodes. That is, all the edges in the graphical model are directed and there are no cycles. For the purpose of fraud detection, we construct two Bayesian networks to describe the behavior of auto insurance. First, a Bayesian network is constructed to model behavior under the assumption that the driver is fraudulent (F) and another model under the assumption the driver is a legitimate user (NF), see Figure 3. The fraud net is set up by using expert knowledge. The user net is set up by using data from non fraudulent drivers. During operation user net is adapted to a specific user based on emerging data. By inserting evidence in these networks (the observed user behavior x derived from his toll tickets) and propagating it through the network, we can get the probability of the measurement x under two above mentioned hypotheses. This means, we obtain judgments to what degree an observed user behavior meets typical fraudulent or non-fraudulent behavior. These quantities we call p(x NF) and p(x F). By postulating the probability of fraud P(F) and P(NF) = 1- P(F) in general and by applying Bayes rule, we get the probability of fraud, given 39
6 the measurement x, P(F x) = P(F)p(x F)/ p(x) where, the denominator p(x) can be calculated as P(x) = P(F)p(x F) + P(NF)p(x NF) The chain rule of probabilities is: Suppose there are two classes C 1, C 2 for fraud and legal respectively. Given an instance X = (X 1, X 2,, X n ) and each row is represented by an attribute vector A = (A 1, A 2,, A n ) The classification is to derive the maximum P(C i X) which can be derived from Bayes theorem as given in the following steps: i) P(fraud X) = [P(fraud X) P(fraud)] / P(X) P(legal X) = [P(legal X) P(legal)] / P(X) As P(X) is constant for all classes, only [P(fraud X) P(fraud)] and [P(legal X) P(legal)] need to be maximized. ii) The class prior probabilities may be estimated by: P(fraud) = s i / s Here, s is the total number of training examples and s i is the number of training examples of class fraud. iii) A simplified assumption of no dependence relation between attributes is made. Thus, n P(X fraud) = P(x k fraud) k= 1 n and P(X legal) = P(x k legal) k= 1 The probabilities P(x 1 fraud), P(x 2 fraud) can be estimated from the training samples: P(x k fraud) = s ik / s i Here, s i is the number of training examples for class fraud and s i k is the number of training examples of class with value x k for A k 40
7 3.1.1 Application We present Bayesian learning algorithm to predict occurrence of fraud. Using the Output classification results for Table 1, there are 17 tuples classified as legal, and 3 as fraud. To facilitate classification, we divide the age of driver attribute into ranges: Table 1: Training set Instance Name Gender Age_ fault Driver_ Vehicle_ Output driver rating age 1 David Okere M legal 2 Beau Jackson M fraud 3 Jeremy Dejean M legal 4 Robert Howard M legal 5 Crystal Smith F legal 6 Chibuike Penson M legal 7 Collin Pyle M legal 8 Eric Penson M fraud 9 Kristina Green F legal 10 Jerry Smith M legal 11 Maggie Frazier F legal 12 Justin Howard M fraud 13 Michael Vasconi M legal 14 Bryan Thompson M legal 15 Chris Wilson M legal 16 Michael Pullen M legal 17 Aaron Dusek M legal 18 Bryan Sanders M legal 19 Derek Garrett M legal 20 Jasmine Jackson F legal X Crystal Smith F ? Table 2 shows the counts and subsequent probabilities associated with the attributes. With these simulated training data, we estimate the prior probabilities: 41
8 The classifier has to predict the class of instance to be fraud or legal. P(fraud) = s i / s = 3/20 = 0.15 P(legal) = s i / s =17/20 = 0.85 Table 2: Probabilities associated with attributes Attribute Value Count Probabilities legal fraud legal Fraud Gender M /17 3/3 F 4 0 4/17 0/3 age_driver (20, 25) 3 0 3/18 0 (25, 30) 4 0 4/18 0 (30, 35) 3 1 3/18 1/2 (35, 40) 3 1 3/18 1/2 (40, 45) 3 0 3/18 0 (45, 50) 2 0 2/18 0 fault / /17 3/17 driver_rating /17 1/ / / /17 2/3 We use these values to classify a new tuple. Suppose, we wish to classify X = (Crystal Smith, F, 31). By using these values and the associated probabilities of gender and driver age, we obtain the following estimates: P(X legal) = 4/17 * 3/18 = P(X fraud) = 3/3 * 1/2 = Thus, likelihood of being legal = *0.9= Likelihood of being fraud = *0.1= We estimate P(X) by summing up these individuals likelihood values since X will be either legal of fraud: P(X) = =
9 Finally, we obtain the actual probabilities of each event: P(legal X) = (0.039 *0.9)/0.0851= P(fraud X) = (0.500 *0.1) / = Therefore, based on these probabilities, we classify the new tuple as fraud because it has the highest probability. Since attributes are treated as independent, the addition of redundant ones reduces its predictive power. To relax this conditional independence is to add derived attributes which are created from combinations of existing attributes. Missing data cause problems during classification process. Naïve Bayesian classifier can handle missing values in training datasets. To demonstrate this, seven missing values appear in dataset. The naïve Bayes approach is easy to use and only one scan of the training data is required. The approach can handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. Although the approach is straightforward, it does not always yield satisfactory results. The attributes usually are not independent. We could use subset of the attributes by ignoring any that are dependent on others. The technique does not handle continuous data. Dividing the continuous values into ranges could be used to solve this problem, but the division of the continuous values is a tedious task, and how this is done can impact the results. 3.2 DECISION TREE-BASED ALGORITHM A decision tree (DT) is a tree associated with a database that has each internal node labeled with an attribute, each arc labeled with a predicate that can be applied to the attribute, and each leaf node labeled with a class. Solving the classification problem is a two-step process: i) decision tree inductionconstruct a DT, and ii) apply the DT to determine its class. Rules can be generated that are easy to interpret. They scale well for large databases because the tree size is independent of the database size. DT algorithms do not easily handle continuous data. The attribute domains must be divided into categories. Handling missing is difficult. Since the DT is constructed from the training data, overfitting may occur. This can be overcome via tree pruning C4.5 Algorithm The basic algorithm for decision tree is s follows: i) Suppose there are two classes for fraud and legal. The tree starts as a single node N representing the training samples. 43
10 ii) If the samples are of the same class fraud, then the node becomes a leaf and is labeled as fraud. iii) Otherwise, the algorithm uses an entropy-based measure as a heuristic for selecting the attribute that will best separate the samples into individual classes. The entropy, or expected information needed to classify a given sample is: I(fraud, legal)= - (NumberFraudSamples / NumberSamples) log 2 (NumberFraudSamples / NumberSamples) (NumberLegalSamples / NumberSamples) log 2 (NumberLegalSamples / NumberOfSamples) iv) Expected information or entropy required to classify into subsets by test attribute E is: E(A) = [(NumberTestAttributeFraudValues/ NumberSamples) + (NumberTestAttributeLegalValues/ NumberSamples)]* [I(TestAttributeFraudValues, TestAttributeLegalValues)] v) Expected reduction in entropy is: Gain(A)= I E(A) The algorithm computes the information gain of each attribute. The attribute with highest information gain is the one selected for test attribute. vi) A branch is created for each known value of the test attribute. The algorithm uses the same process iteratively to form a decision tree at each partition. Once an attribute has occurred at a node, it need not be considered in any of the node s descendents. The iterative partitioning stops only when one of the conditions is true: a) all examples for a given node belong to the same class, or b) there are no remaining attributes on which samples may be further partitioned. If this is the case, a leaf is created with the class in majority among samples, c) there are no samples for the branch test-attribute. In this case, a leaf is created with the majority class in samples 3.3 Rule Based Algorithm One way to perform classification is to generate if-then rules. There are algorithms that generate rules from trees as well as algorithms that generate rules without first creating DTs. 44
11 3.3.1 Generating Rules from a Decision Tree The following rules are generated for the Decision Tree (DT). If driver age =25, then class = legal If (driver_age =40) (vehicle_age =7), then class = legal If (driver_age =32) ) (driver_rating =1), then class = fraud If (driver_age 40) ) (driver_rating =1) ) (vehicle_age =2), then class = fraud If (driver_age > 40) ) (driver_age 50) ) (driver_rating = 0.33), then class = legal 4. MODEL PERFORMANCE 4.1 Confusion Matrix There are two ways to examine the performance of classifiers: i) confusion matrix, and ii) to use a ROC graph. Given a class, C j, and a tuple, t i, that tuple may or may not be assigned to that class while its actual membership may or may not be in that class. With two classes, there are four possible outcomes with the classification as: i) true positives (hits), ii) false positives (false alarms), iii) true negatives (correct rejections), and iv) false negatives. False positive occurs if the actual outcome is legal but incorrectly predicted as fraud. False negative occurs when the actual outcome is fraud but incorrectly predicted as legal. A confusion matrix (Kohavi and Provost, 1998), Table 3a, contains information about actual and predicted classifications. Performance is evaluated using the data in the matrix. Table 3b shows confusion matrix built on simulated data. It shows the classification model being applied to the test data that consists of 7000 instances roughly split evenly between two classes. The model commits some errors and has an accuracy of 78%. We also applied the model to the same data, but to the negative class with respect to class skew in the data. The quality of a model highly depends on the choice of the test data. We also that that ROC curves are not so dependent on the choice of test data, at least with class skew. 45
12 Table 3a: Confusion Matrix Observed legal fraud predicted legal TP FP fraud FN TN Table 3b: Confusion matrix of a model applied to test dataset Observed accuracy: legal fraud 0.78 predicted legal recall: 0.86 fraud precision: 0.70 A number of model performance metrics (Table 3c) can be derived from the confusion matrix. model performance metrics Accuracy(AC) Table 3c: Performance metrics Recall or true positive rate(tp) False positive rate(fp) True negative rate(tn) False negative rate(fn) Precision(P) geometric mean(g-mean) F-measure 46
13 The accuracy determined in (Table 3b) may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al.., 1998). Suppose there are 1500 cases, 1460 of which are negative cases and 40 of which are positive cases. If the system classifies them all as negative, the accuracy would be 97.3%, even though the classifier missed all positive cases. Other performance measures are geometric mean (g-mean) (Kubat et al.., 1998), and F-Measure (Lewis and Gale, 1994). For calculating F-measure, β has a value from 0 to and is used to control the weight assigned to TP and P. Any classifier evaluated using g-mean or F- measure will have a value of 0, if all positive cases are classified incorrectly. To easily view and understand the output, visualization of the results is helpful. Naïve Bayesian visualization provides an interactive view of the prediction results. The attributes can be sorted by the predictor and evidence items can be sorted by the number of items in its storage bin. Attribute column graphs help to find the significant attributes in neural networks. Decision tree visualization builds trees by splitting attributes from C4.5 classifiers. Cumulative gains and lift charts are visual aids for measuring model performance. Lift is a measure of a predictive model calculated as the ratio between the results obtained with or without the predictive model. For instance, if 105 of all samples are actually fraud and a naïve Bayesian classifier could correctly predict 20 fraud samples per 100 samples, then that corresponds to a lift of 4. Table 4: Costs of Predictions fraud True Positive(Hit) cost= number of hits* average cost per investigation False Negative(miss) cost= number of misses* average cost per claim 47 legal False Positive(False alarm) cost=number of false alarms * (Average cost per investigation + average cost per claim) True Negative(correct rejection) cost = number of correct rejection claims * average cost per claim Table 4 shows that True Positives (hits) and False Positives (false alarms) require cost per investigation. False alarms cost are the most expensive because both investigation and claim costs are required. False Negatives (misses) and True Negatives(correct rejection) are the cost of claim. 4.2 Relative Operating Characteristic Curve Another way to examine the performance of classifiers is to use a Relative
14 Operating Characteristic (ROC) curve, (Swets, 1988). A ROC graph is a curve that depicts the performance and performance tradeoff of a classification model (Fawcett, 2004, Flach, 2004) with the False Positives along X -axis and the True Positives along the Y axis. The point (0, 1) is the perfect classifier: it classifies all positive cases and negative cases correctly. It is (0, 1) because the false positive FP is 0, and the TP rate is 1. The point (0, 0) represents a classifier that predicts all cases to be negative, while the point (1, 1) corresponds to a classifier that predicts every case to be positive. Point (1, 0) is the classifier that is incorrect for all classifications. An ROC curve or point is independent of class distribution or error costs (Provost et al.., 1998). It sums all information contained in the confusion matrix, since FN is the complement of TP and TN is the complement of FP (Swets, 1988). It provides a visual tool for examining the exchange between a classifier to correctly identify positive cases and the number of negative cases incorrectly classified. We introduce to new performance metrics to construct ROC curves (in confusion matrix terms), the TP Rate (TPR) and the FP Rate (FPR): TPR(recall) = TP / (TP+FN) FPR = FP / (TN +FP) The classifier is mapped to the same point in the ROC graph regardless of whether the original test set with sampled down negative class is used illustrating that ROC graphs are not sensitive to class skew. One way of comparing ROC points is by using an equation that equates accuracy with the Euclidian distance from the perfect classifier, the point (0, 1). We include a weight factor that allows defining relative misclassification costs. We define AC d as a distance based performance measure:, where W ranges from 0 to 1, that is used to assign relative importance to false positives and false negatives. AC d ranges from 0 for the perfect classifier to sqrt(2) for a classifier that classifies all cases incorrectly. AC d differs from g-mean 1, g-mean 2 and F-measure in that it is equal to 0 only if all cases are classified correctly. In other words, a classifier evaluated using AC d gets some credit for correct classification of negative cases, regardless of its accuracy in correctly identifying positive cases. 5. CONCLUSIONS We studied the existing fraud detection systems. To predict and present fraud we used Naïve Bayesian classifier. We looked at model performance metrics derived from the confusion matrix. We illustrated how ROC curves can be deployed for model assessment. Performance metrics such as accuracy, recall, 48
15 and precision are derived from the confusion matrix. ROC analysis provides a highly visual account of a model s performance. It is strong with respect to class skew, making it a reliable performance metric in many important fraud detection application areas. REFERENCES Barse, E., Kvarnstrom, H. & Jonsson, E. (2003). Synthesizing Test Data for Fraud Detection Systems. Proc. of the 19th Annual Computer Security Applications Conference, Bell, T. & Carcello, J. (2000). A Decision Aid for Assessing the Likelihood of Fraudulent Financial Reporting. Auditing: A Journal of Practice and Theory 10(1): Belhadji, E., Dionne, G. & Tarkhani, F. (2000). A Model for the Detection of Insurance Fraud. The Geneva Papers on Risk and Insurance 25(4): Bentley, P. (2000). Evolutionary, my dear Watson: Investigating Committeebased Evolution of Fuzzy Rules for the Detection of Suspicious Insurance Claims. Proc. of GECCO2000. Bentley, P., Kim, J., Jung., G. & Choi, J. (2000). Fuzzy Darwinian Detection of Credit Card Fraud. Proc. of 14th Annual Fall Symposium of the Korean Information Processing Society. Bhargava, B., Zhong, Y., & Lu, Y. (2003). Fraud Formalization and Detection. Proc. of DaWaK2003, Bolton, R. & Hand, D. (2002). Statistical Fraud Detection: A Review (With Discussion). Statistical Science 17(3): Bolton, R. & Hand, D. (2001). Unsupervised Profiling Methods for Fraud Detection. Credit Scoring and Credit Control VII. Bonchi, F., Giannotti, F., Mainetto, G., Pedreschi, D. (1999). A Classificationbased Methodology for Planning Auditing Strategies in Fraud Detection. Proc. of SIGKDD99, Brause, R., Langsdorf, T. & Hepp, M. (1999). Neural Data Mining for Credit Card Fraud Detection. Proc. of 11th IEEE International Conference on Tools with Artificial Intelligence. Brockett, P., Derrig, R., Golden, L., Levine, A. & Alpert, M. (2002). Fraud Classification using Principal Component Analysis of RIDITs. Journal of Risk and Insurance 69(3): Burge, P. & Shawe-Taylor, J. (2001). An Unsupervised Neural Network Approach to Profiling the Behavior of Mobile Phone Users for Use in Fraud Detection. Journal of Parallel and Distributed Computing 61:
16 Cahill, M., Chen, F., Lambert, D., Pinheiro, J. & Sun, D. (2002). Detecting Fraud in the Real World. Handbook of Massive Datasets Chan, P., Fan, W., Prodromidis, A. & Stolfo, S. (1999). Distributed Data Mining in Credit Card Fraud Detection. IEEE Intelligent Systems 14: Chen, R., Chiu, M., Huang, Y. & Chen, L. (2004). Detecting Credit Card Fraud by Using Questionnaire-Responded Transaction Model Based on Support Vector Machines. Proc. of IDEAL2004, Cortes, C., Pregibon, D. & Volinsky, C. (2003). Computational Methods for Dynamic Graphs. Journal of Computational and Graphical Statistics 12: Cox, E. (1995). A Fuzzy System for Detecting Anomalous Behaviors in Healthcare Provider Claims. In Goonatilake, S. & Treleaven, P. (eds.) Intelligent Systems for Finance and Business, John Wiley. Elkan, C. (2001). Magical Thinking in Data Mining: Lessons from CoIL Challenge Proc. of SIGKDD01, Ezawa, K. & Norton, S. (1996). Constructing Bayesian Networks to Predict Uncollectible Telecommunications Accounts. IEEE Expert October: Fan, W. (2004). Systematic Data Selection to Mine Concept- Drifting Data Streams. Proc. of SIGKDD04, Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning, 3. Fawcett, T., & Flach, P. A. (2005). A response to web and Ting s on the application of ROC analysis to predict classification performance under varying class distributions. Machine Learning, 58(1): Flach, P. (2004). Tutorial at ICML 2004: The many faces of ROC analysis in machine learning. Unpublished manuscript. Flach, P., Blockeel, H., Ferri, C., Hernandez-Orallo, J., & Struyf, J. (2003). Decision support for data mining: Introduction to ROC analysis and its applications. Data mining and decision support: Aspects of integration and collaboration, Flach, P. A. (2003). The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. Proceedings of the Twentieth International Conference on Machine Learning, Foster, D. & Stine, R. (2004). Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy. Journal of American Statistical Association 99:
17 He H, Wang J, Graco W and Hawkins S.(1997). Application of Neural Networks to Detection of Medical Fraud. Expert Systems with Applications, 13, James F.(2002). FBI has eye on business databases. Chicago Tribune, Knight Ridder/ Tribune Business News. Kim, H., Pang, S., Je, H., Kim, D. & Bang, S. (2003). Constructing Support Vector Machine Ensemble. Pattern Recognition 36: Kim, J., Ong, A. & Overill, R. (2003). Design of an Artificial Immune System as a Novel Anomaly Detector for Combating Financial Fraud in Retail Sector. Congress on Evolutionary Computation. Lin, J., Hwang, M. & Becker, J. (2003). A Fuzzy Neural Network for Assessing the Risk of Fraudulent Financial Reporting. Managerial Auditing Journal 18(8): Little, B., Johnston, W., Lovell, A., Rejesus, R. & Steed, S. (2002). Collusion in the US Crop Insurance Program: Applied Data Mining. Proc. of SIGKDD02, Maes, S., Tuyls, K., Vanschoenwinkel, B. & Manderick, B. (2002). Credit Card Fraud Detection using Bayesian and Neural Networks. Proc. of the 1st International NAISO Congress on Neuro Fuzzy Technologies. Magnify(2002). FraudFocus Advanced Fraud Detection, White Paper, Chicago. Magnify(2002). The Evolution of insurance Fraud Detection: Lessons learnt from other industries, White Paper, Chicago. Major, J. & Riedinger, D. (2002). EFD: A Hybrid Knowledge/Statistical-based system for the Detection of Fraud. Journal of Risk and Insurance 69(3): Meena J(2003). Data mining for Homeland Security. Executive Briefing, VA. Meena J(2003). Investigative Data Mining for Security and Criminal Detection, Butterworth Heinemann, MA. McGibney, J. & Hearne, S. (2003). An Approach to Rules-based Fraud Management in Emerging Converged Networks. Proc. Of IEI/IEEE ITSRS Moreau, Y. & Vandewalle, J. (1997). Detection of Mobile Phone Fraud Using Supervised Neural Networks: A First Prototype. Proc. of 1997 International Conference on Artificial Neural Networks,
18 Ormerod T., Morley N., Ball L., Langley C., and Spenser C. (2003). Using Ethnography To Design a Mass Detection Tool (MDT) For The Early Discovery of Insurance Fraud, Computer Human Interaction, April 5-10, Ft. Lauderdale, Florida. Phua, C., Alahakoon, D. & Lee, V. (2004). Minority Report in Fraud Detection: Classification of Skewed Data, SIGKDD Explorations 6(1): Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth InternationalConference on Machine Learning,, Rosset, S., Murad, U., Neumann, E., Idan, Y. & Pinkas, G. (1999). Discovery of Fraud Rules for Telecommunications - Challenges and Solutions. Proc. of SIGKDD99, SAS e-intelligence(2000). Data Mining in the Insurance industry: Solving Business problems using SAS Enterprise Miner Software, White Paper. Shao, H., Zhao, H. & Chang, G. (2002). Applying Data Mining to Detect Fraud Behavior in Customs Declaration. Proc. of 1 st International Conference on Machine Learning and Cybernetics, Sherman, E. (2002). Fighting Web Fraud. Newsweek, June 10. SPSS(2003). Data mining and Crime analysis in the Richmond Police Department, White Paper, Virginia. Stefano, B. & Gisella, F. (2001). Insurance Fraud Evaluation: A Fuzzy Expert System. Proc. of IEEE International Fuzzy Systems Conference, Syeda, M., Zhang, Y. & Pan, Y. (2002). Parallel Granular Neural Networks for Fast Credit Card Fraud Detection. Proc. of the 2002 IEEE International Conference on Fuzzy Systems. Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Better decisions through science. Scientific American; Scientific American, 283(4), Von Altrock, C. (1997). Fuzzy Logic and Neurofuzzy Applications in Business and Finance Prentice Hall. Viaene, S., Derrig, R. & Dedene, G. (2004). A Case Study of Applying Boosting Naive Bayes to Claim Fraud Diagnosis. IEEE Transactions on Knowledge and Data Engineering 16(5): Yamanishi, K., Takeuchi, J., Williams, G. & Milne, P. (2004). On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery 8: Weatherford, M.(2002). Mining for Fraud. IEEE Intelligent Systems, July/ August,
19 Wheeler, R. & Aitken, S. (2000). Multiple Algorithms for Fraud Detection. Knowledge-Based Systems 13(3): Williams, G. J. and Huang, Z.(1997). Mining the Knowledge: Mine the Hot Spots Methodology for Mining Large Real World Databases, 10 th Australian Joint Conference on Artificial Intelligence, Published in Lecture Notes in Artificial Intelligence, Springer-Verlag, December, Perth, Western Australia. Williams, G.(1999). Evolutionary Hot Spots Data Mining: An Architecture for Exploring for Interesting Discoveries, Proceedings of the 3rd Pacific-Asia Conference in Knowledge Discovery and Data Mining, Beijing, China. 53
20 54
Electronic Payment Fraud Detection Techniques
World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141, 2012 Electronic Payment Fraud Detection Techniques Adnan M. Al-Khatib CIS Dept. Faculty of Information
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
To improve the problems mentioned above, Chen et al. [2-5] proposed and employed a novel type of approach, i.e., PA, to prevent fraud.
Proceedings of the 5th WSEAS Int. Conference on Information Security and Privacy, Venice, Italy, November 20-22, 2006 46 Back Propagation Networks for Credit Card Fraud Prediction Using Stratified Personalized
Data Mining for Fraud Detection: Toward an Improvement on Internal Control Systems?
Data Mining for Fraud Detection: Toward an Improvement on Internal Control Systems? Mieke Jans, Nadine Lybaert, Koen Vanhoof Abstract Fraud is a million dollar business and it s increasing every year.
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Credit Card Fraud Detection Using Self Organised Map
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 13 (2014), pp. 1343-1348 International Research Publications House http://www. irphouse.com Credit Card Fraud
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Unsupervised Outlier Detection in Time Series Data
Unsupervised Outlier Detection in Time Series Data Zakia Ferdousi and Akira Maeda Graduate School of Science and Engineering, Ritsumeikan University Department of Media Technology, College of Information
Investigative Data Mining in Fraud Detection
Investigative Data Mining in Fraud Detection Chun Wei Clifton Phua BBusSys A Thesis submitted in partial fulfilment of the requirement for the Degree of Bachelor of Business Systems (Honours) School of
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
Minority Report in Fraud Detection: Classification of Skewed Data
ABSTRACT Minority Report in Fraud Detection: Classification of Skewed Data This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
D A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
Online Credit Card Application and Identity Crime Detection
Online Credit Card Application and Identity Crime Detection Ramkumar.E & Mrs Kavitha.P School of Computing Science, Hindustan University, Chennai ABSTRACT The credit cards have found widespread usage due
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil [email protected] 2 Network Engineering
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
How To Detect Fraud With Reputation Features And Other Features
Detection Using Reputation Features, SVMs, and Random Forests Dave DeBarr, and Harry Wechsler, Fellow, IEEE Computer Science Department George Mason University Fairfax, Virginia, 030, United States {ddebarr,
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES
International Journal of Scientific and Research Publications, Volume 4, Issue 4, April 2014 1 CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES DR. M.BALASUBRAMANIAN *, M.SELVARANI
LVQ Plug-In Algorithm for SQL Server
LVQ Plug-In Algorithm for SQL Server Licínia Pedro Monteiro Instituto Superior Técnico [email protected] I. Executive Summary In this Resume we describe a new functionality implemented
A COMPARATIVE ASSESSMENT OF SUPERVISED DATA MINING TECHNIQUES FOR FRAUD PREVENTION
A COMPARATIVE ASSESSMENT OF SUPERVISED DATA MINING TECHNIQUES FOR FRAUD PREVENTION Sherly K.K Department of Information Technology, Toc H Institute of Science & Technology, Ernakulam, Kerala, India. [email protected]
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]
Computational Intelligence in Data Mining and Prospects in Telecommunication Industry
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 Scholarlink Research Institute Journals, 2011 (ISSN: 2141-7016) jeteas.scholarlinkresearch.org Journal of Emerging
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Anomaly Detection Using Unsupervised Profiling Method in Time Series Data
Anomaly Detection Using Unsupervised Profiling Method in Time Series Data Zakia Ferdousi 1 and Akira Maeda 2 1 Graduate School of Science and Engineering, Ritsumeikan University, 1-1-1, Noji-Higashi, Kusatsu,
DATA MINING APPLICATION IN CREDIT CARD FRAUD DETECTION SYSTEM
Journal of Engineering Science and Technology Vol. 6, No. 3 (2011) 311-322 School of Engineering, Taylor s University DATA MINING APPLICATION IN CREDIT CARD FRAUD DETECTION SYSTEM FRANCISCA NONYELUM OGWUELEKA
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Detecting Credit Card Fraud by Decision Trees and Support Vector Machines
Detecting Credit Card Fraud by Decision Trees and Support Vector Machines Y. Sahin and E. Duman Abstract With the developments in the Information Technology and improvements in the communication channels,
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant
Application of Hidden Markov Model in Credit Card Fraud Detection
Application of Hidden Markov Model in Credit Card Fraud Detection V. Bhusari 1, S. Patil 1 1 Department of Computer Technology, College of Engineering, Bharati Vidyapeeth, Pune, India, 400011 Email: [email protected]
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
NEURAL NETWORKS IN DATA MINING
NEURAL NETWORKS IN DATA MINING 1 DR. YASHPAL SINGH, 2 ALOK SINGH CHAUHAN 1 Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India 2 Lecturer, United Institute of Management, Allahabad,
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
Meta Learning Algorithms for Credit Card Fraud Detection
International Journal of Engineering Research and Development e-issn: 2278-67X, p-issn: 2278-8X, www.ijerd.com Volume 6, Issue 6 (March 213), PP. 16-2 Meta Learning Algorithms for Credit Card Fraud Detection
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
DETECTING COLLUSIVE FRAUD IN ENTERPRISE RESOURCE PLANNING SYSTEMS
Chapter 11 DETECTING COLLUSIVE FRAUD IN ENTERPRISE RESOURCE PLANNING SYSTEMS Asadul Islam, Malcolm Corney, George Mohay, Andrew Clark, Shane Bracher, Tobias Raub and Ulrich Flegel Abstract As technology
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
How To Detect Credit Card Fraud
Card Fraud Howard Mizes December 3, 2013 2013 Xerox Corporation. All rights reserved. Xerox and Xerox Design are trademarks of Xerox Corporation in the United States and/or other countries. Outline of
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
Course Syllabus. Purposes of Course:
Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Equity forecast: Predicting long term stock price movement using machine learning
Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS IMBALANCED LEARNING IN CREDIT CARD FRAUD DETECTION
International Journal of Economics, Commerce and Management United Kingdom Vol. III, Issue 12, December 2015 http://ijecm.co.uk/ ISSN 2348 0386 A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
