Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support

Size: px
Start display at page:

Download "Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support"

Transcription

1 Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support WILLIAM E. SPANGLER, JERROLD H. MAY, AND LUIS G. VARGAS WILLIAM E. SPANGLER is an Assistant Professor in the College of Business and Economics at West Virginia University. After several years in private industry, he earned his Ph.D. in 1995 from the Katz Graduate School of Business, University of Pittsburgh, specializing in artificial intelligence. His current research interests focus on data mining and computational modeling for decision support. His work has been published in various journals, including Information and Management, Interfaces, Expert Systems with Applications, and IEEE Transactions on Knowledge and Data Engineering. JERROLD H. MAY is a Professor of Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and is also the Director of the Artificial Intelligence in Management (AIM) Laboratory there. He has more than sixty refereed publications in a variety of outlets, ranging from management journals such as Operations Research and Information Systems Research to medical ones such as Anesthesiology and Journal of the American Medical Informatics Association. Professor May s current work focuses on modeling, planning, and control problems, the solutions to which combine management science, statistical analysis, and artificial intelligence, particularly for operational tasks in health-related applications. LUIS G. VARGAS is a Professor of Decision Sciences and Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and Co-Director of the AIM Laboratory. He has published over forty publications in refereed journals such as Management Science, Operations Research, Anesthesiology, and Journal of the American Medical Informatics Association, and three books on applications of the Analytic Hierarchy Process with Thomas L. Saaty. Professor Vargas s current work focuses on the use of operations research and artificial intelligence methods in health care environments. ABSTRACT: Data-mining techniques are designed for classification problems in which each observation is a member of one and only one category. We formulate ten data representations that could be used to extend those methods to problems in which observations may be full members of multiple categories. We propose an audit matrix methodology for evaluating the performance of three popular data-mining techniques-linear discriminant analysis, neural networks, and decision tree inductiov JournalofManagementInformation Systems I Summer 1999, Vol. 16, No. I, pp M.E. Sharpe, Inc I 1999 $

2 38 SPANGLER, MAY, AND VARGAS using the representations that each technique can accommodate. We then empirically test our approach on an actual surgical data set. Tree induction gives the lowest rate of false positive predictions, and a version of discriminant analysis yields the lowest rate of false negatives for multiple category problems, but neural networks give the best overall results for the largest multiple classification cases. There is substantial room for improvement in overall performance for all techniques. KEY WORDS AND PHRASES: data mining, decision support systems, decision tree induction, neural networks, statistical classification. DATA MINING IS THE SEARCH THROUGH REAL-WORLD DATA for general patterns that are useful in classifying individual observations and in making reasoned predictions about outcomes [ 1 I]. That generally entails the use of statistical methods to link a series of independent variables that collectively describe a case or observation to the dependent variable(s) that classifies the case. The set of classification problems typically includes patterns containing a single, well-defined dependent variable or category; that is, an observation is assigned to one and only one category. This research explores the less-tractable problems of multiple classification in data mining, wherein a single observation may be classified into more than one category. Because multiple classification is a significant aspect of numerous managerial tasks-including, among others, diagnosis, auditing, and schedulineunderstanding the effectiveness of datamining methods in multiple classification situations has important implications for the use of information systems for knowledge-based decision support. By multipleclassijication, we mean that the categories are well defined and mutually exclusive, but the observations themselves transcend categorical boundaries. This contrasts withfuzzy clustering, in which the categories themselves are not necessarily either well defmed or mutually exclusive. Consider, for the example, universe ofcategories that includes men, researchers, andjazz singers. In this situation, a single person can belong to any combination of these categories simultaneously, in effect having multiple membership across categories. ClassifLing someone in such a context requires recognizing the potential for multiple membership, while also identifjing the correct categories themselves. Figures 1 and 2 show two alternative ways of pictorially representing multiple classification problems. In figure 1, the categories are shown as distinct and mutually exclusive, with individual cases or observations transcending categories. Figure 2 shows the categories in a type of Venn diagram, with multiply classified cases appearing in the intersections between categories. In contrast to the situation we consider, when the categories themselves are poorly defined or understood, classification of observations within those categories may be correspondingly uncertain. Fuzzy clustering assigns observations with likelihoods to multiple categories, where the likelihoods sum to one. Multiple Classification Problems-An Example DATA MINING IN A DIAGNOSTIC SETTING IS THE SEARCH for patterns linking state descriptions with associated outcomes, with the objective of predicting an outcome

3 CHOOSING DATA-MINING METHODS 39 Figure 1. Observations Classified into Multiple Categories: Observation 01 is in Category C1, 02 is in CI and C2, while 03 is in CI, C2, and C3. Figure 2. A Multiple ClassificationDomainPicturedas a Venn Diagram, with Multiply Classified Observations Appearing the Intersections given new data showing a similar pattern. It can be characterized as an attempt to extract knowledge from data. In the medical domain, numerical codes are used to describe both what is wrong with patient a and what was done to treat the patient. The dominant patient state taxonomy is the International Classification of Diseases (IC- 9) coding system, which indicates a patient s disease or condition within hierarchical, a numerical classification scheme. The corresponding procedural taxonomy is the Common Procedural Terminology (CPT) system, which indicates the procedure(s) performed on a patient, partly in response to the ICD-9 code@) assigned to the patient. Our empirical results are derived from 59,864 patient records, each of which includes the patient s diagnoses (ICD-9s), the procedures performed on the patient (CPTs), and patient demographic and case-specific information.

4 ~~~~ 40 SPANGLER, MAY, AND VARGAS The data-mining task is to find patterns linking patient ICD-9 and demographic information to CPT outcomes. The task is important for surgical scheduling because patient diagnoses and demographics are known before the patient enters the operating room, but the procedures that will be performed there often are not known with certainty. The surgery performed is a function of the information discovered by the physicians during the course of the operation. Data mining in this situation is a multiple-classification task because several procedures may be performed on a patient during a single operation. The patient records to which we had access contain as many as three CPTs each. The ICD-9(s) and patient demographics are the independent variables in the analysis. The surgical procedures performed on the patient (CPTs) are the dependent variables. Each set of independent variables is linked to one, two, or three CPTs. The identification of the procedures to be performed on a patient is important for a number of managerial tasks in this domain, including medical auditing, operating room scheduling, and materials management. If a surgical scheduler knew in advance the most likely sets of procedures and had models for estimating times for sets of procedures, the scheduler could more optimally plan the operating room schedule. The problem for a data-mining method in this multiple classification domain is twofold: First, the method must be able to identify the proper number of CPTs associated with a specific pattern. That is, if a set of patient factors is normally associated with two CPTs, the method should construct a pattern linking the factors to two and only two CPTs. Second, the method should identify the specific CPTs. Scoring the performance of a data-mining tool requires consideration of both of these aspects of the problem. Comparison of Data-Mining Methods THE EXAMPLE ABOVE HINTS AT THE CHARACTERISTICS of the multiple classification problem that make it interesting as well as dificult. Our goal is to find and to propose solutions to the following associated problems of multiple classification: Problem representation: How should a decision maker structure a model both to recognize and to identify multiple classes? Should the dependent variables be treated individually, as a series of yeslno questions related to the presence or absence of each variable, even when they occur as a group in a single record? Alternatively, should a group of dependent variables be treated as a separate entity, distinct from the individual variables that comprise the group? Performance measurement: In multiple classification, there is potential for both false positives (i.e., assigning an observation to an incorrect class) and false negatives (i.e., not assigning an observation to a correct class). A decision maker requires a strategy for scoring the performance of various data-mining methods, based on the relative number of both types of errors and their associated costs. Those research issues could be investigated using mathematical arguments or numerically. While a mathematical comparison would be the most definitive one, we are not aware of a methodology that would permit it to be done. Numerical research

5 CHOOSING DATA-MINING METHODS 41 Table 1 Comparison of Data-Mining Methods Decision tree induction Neural networks Discriminant analysis Type of method Logic-based Math-based Math-based Learning approach Supervised Supervised Supervised Linearity Linear Nonlinear Linear Representational Set of decision Functional Functional scheme nodes and branches: relationship relationship production system between attributes between attributes and classes and classes can be done with artificially generated data or with real data. Conclusions based on empirical research are useful if the characteristics of the samples on which they are based are sufficiently similar to those of problem instances others are likely to encounter in practice. We preferred using a real data set to the use of generated data because it is representative of an important class of managerial problems, and because the noise in the data set provides us with information on the sensitivity of the approaches to dirt in a data set. Artificial data would have allowed us to carehlly control the population parameters from which the data are drawn, but would have required that we first define and estimate values for all such critical population parameters. Using our real data set, we empirically compare the performance of tree and rule induction (TRI), artificial neural networks (ANN), and linear discriminant analysis (LDA) in modeling the multiple classification patterns in our data set. We chose the three methods because each is a popular data-mining method sharing a number of common characteristics while also exhibiting some notable differences (see Table 1). Weiss and Indurkhya divide data-mining algorithms into three groups: math-based methods, distance-based methods, and logic-based methods [29]. LDA is the most common math-based method, as well as the most common classification technique in general, while TRI is the most common logic-based method. Neural networks are an increasingly popular nonlinear math-based method. Tools employing these methods are commonly available in commercial computer-based applications. All three methods are supervised learning techniques. That is, they induce rules for assigning observations to predefined classes from a set of examples, as opposed to unsupervised techniques, which both define classes and determine classification rules [20, 251. Supervised learning techniques are appropriate for our decision problems because the classes (CPT codes) are defined exogenously and cannot be modified by the decision maker. Each of the methods we compare engages in discrete classification through a process of selection and combination of case attributes, and each employs similar validation techniques, described below. Cluster analysis and knowledge discovery methods are examples of unsupervised learning algorithms. The methods also differ, particularly in the way they model the relationships among

6 42 SPANGLER, MAY, AND VARGAS attributes and classes. The classification structures of LDA and ANN are expressed mathematically as a functional relationship between weighted attributes and resulting classes. TRI represents relationships as a set of decision nodes and branches, which, in turn, can be represented as a production system, or set of rules. LDA and TRI are linear approaches; NN is nonlinear. The effectiveness of the representation generally depends on the orientation and mathematical sophistication of the user. Liang argues that, because the choice of a learning method for an application is an important problem, research into the comparison of alternative (or perhaps even complementary) methods is likewise important [ 191. That is especially true for data mining, where the costs and potential benefits involved strongly motivate the proper choice of tool and method as well as the proper analysis of the results. Tree and Rule Induction TRI is attractive because its explicit representation of classification as a series of binary splits makes the induced knowledge structure easy to understand and validate. TRI constructs a tree, but the tree can be translated into equivalent an set of rules. We used Quinlan s See5 package, the most recent version of his ID3 algorithm [22]. ID3 induces a decision tree from a table of individual cases, each of which describes identified attributes as well as the class to which the case belongs. At each node, the algorithm builds the tree by assessing the conditional probabilities linking attributes and outcomes, and divides the subset of cases under consideration into two further subsets so as to minimize entropy, a measure of the information content of the data. The user specifies parameters that control the stopping behavior of the method. If the training set contains no contradictory casesthat is, cases with identical attributes that are members of different classesa fully grown tree will produce an error rate of zero on the training set. Weiss and Kulikowski [31] show that as a tree becomes more complex, measured by the number of decision nodes it contains, the danger of overfitting the data increases, and the predictive power of the tree declines commensurately. That is, the true predictive error rate, measured by the performance of the tree on test cases, becomes much higher than the apparent error rate reflected in the performance of the tree against the training cases alone. To minimize the true error rate, See5 first grows the tree completely, and then prunes it based on a prespecified certainty factor at each node. Performance evaluation ofclassification methodologies is discussed in the statistical literature (for example, see [15], ch. 11). The two most common approaches are dividing the data set into training and holdout subsets before estimating the model and jackknifing. The former avoids the bias of using the same information for both creating and judging the model. However, it requires large data sets, and there is no simple way to provide a definitive rule for determining either the size or composition of the two subsets. Worse, separating a holdout sample results in the creation of a model that is not the desired one, because the removal ofthe holdout sample reduces the information content of the training set and may exclude cases that are critical to the estimation of an accurate or robust model. The alternative to separating the data into two groups is

7 CHOOSING DATA-MINING METHODS 43 jackknifing, a one-at-a-time holdout procedure due to Lachenbruch [ 181. Jackknifing temporarily ignores the first observation, estimates the model using observations two through n, classifies the first, held-out observation, and notes whether it was correctly or incorrectly classified. It then puts the first observation back into the data set, ignores the second, and repeats the process. Repeating the procedure n times, jackknifing creates n different models and tallies up overall modeling performance based on the behavior of each of those models on a single omitted observation. Omitting only a single observation minimizes the loss of information to which the modeling process is exposed, but it can require a lot of computer time and does not produce a single model as its result. How do you combine n potentially very different models, and how do you interpret its evaluation when each holdout sample was tested on a different model? The TRI software package See5 includes a k-fold cross-validation, an approach less extreme than either a fixed holdout sample or jackknifing. K-fold crossvalidation divides the data set into k equal-sized partitions, ignores them one at a time, estimates a model, and computes its error rate on the ignored partition. Neural Networks Artificial neural networks simulate human cognition by modeling the inherent parallelism of neural circuits found in the brain using mathematical models of how the circuits function. The models typically are composed of a layer of input nodes (independent variables), one or more layers of intermediate (or hidden) nodes, and a layer of output nodes (dependent variables). Nodes in a layer are each connected by one-way arcs to nodes in subsequent layers, and signals are sent over those arcs. Behavior propagates from values set in the input nodes, sent over arcs through the hidden layer(s), and results in the establishment of values in the output layer. The value of a node is a nonlinear, usually logistic, function of the weighted sum of the values sent to it by nodes that are connected to it. A node forwards a signal to a subsequent node only if it exceeds a threshold value. An ANN model is specified by defining the number of layers it has, the number of nodes in each layer, the way in which the nodes are connected, and the nonlinear function used to compute node values. Estimation of the specified model involves determining the best set of weights for the arcs and threshold values for the nodes. An ANN is trainebthat is, its parameters are estimatehsing nonlinear optimization. In the backpropugation algorithm, the first-order gradient descent method used in the Brainmaker software we used, the network propagates inputs through the network, derives a set of output values, compares the computed output to the provided (corresponding) output, and calculates the difference between the two numbers (Le., the error). If a difference exists, the algorithm proceeds backward through the hidden layer(s) to the input layer, adjusting the weights between connections based on their gradients to reduce the sum of squared output errors. The algorithm stops when the total error is acceptably small. Neural networks are frequently used in data mining because, in adjusting the number of layers, nodes, and connections, the user can make an ANN model almost any smooth

8 44 SPANGLER, MAY, AND VARGAS mathematical function. While inputs to an ANN might be integer or discrete, the weighted nonlinear transformations of the inputs as part of their being fed forward through the network result in continuous output level values. Continuous output levels result in a more tractable error measure for the backpropagation algorithm to optimize and also permit the interpretation of outputs as partial group membership. Partial group membership means an ANN is capable of representing inexact matching, if that is the way to find a best fit for some set of input data. It also can model classification tasks that are inherently fuzzy --that is, tasks that generally are simple for humans but traditionally difficult for computers. Because of their flexibility, ANNs may be difficult to specify. Adding too much structure to an ANN makes it prone to overfitting, but too little structure may prevent it from capturing the patterns in the data set. Those patterns are represented in the arc (connection) weights and the node thresholds, a form that is not transparent to humans. Computationally, if the training set is large, backpropagation and related algorithms may require a lot of time. ANNs may be good classifiers and predictors as compared with linear methods, but the mathematical representations of the various nodes, and the relative importance of the independent variables, tend to be somewhat less accessible to the end user than induced decision trees, rules, and even classification functions. Neural networks often are treated as a black box, with only the inputs and outputs visible to the decision maker. The classification chosen by the ANN is easily visible to the user, but the decision process that led to that classification is not. Linear Discriminant Analysis Linear discriminant analysis (LDA) is the most common classification method in use, and also one of the oldest [ 131, having been developed Fisher by in the 1930s. Because of its popularity and long history, we provide only a brief overview of the method here. Like TRI and ANN, LDA partitions a data set into two or more classes, within which new observations or cases can then be assigned. Because it uses linear functions of the independent variables to define those partitions, LDA is similar to multiple regression. The primary distinction between LDA and multiple regression lies in the form of the dependent variable. Multiple regression uses a continuous dependent variable. The dependent variable in LDA is ordinal or nominal. For a data set of cases, each with m attributes and n categories, LDA constructs classification hnctions of the form, c1 a, + cp c, a, + c, where ci is the coefficient for the case attribute ai and c,, is a constant, for each of the n categories. An observation is assigned to the class for which it has the highest classification function value.

9 CHOOSING DATA-MINING METHODS 45 Table 2 Comparative Data-Mining Method Studies Across Domains Tree/rule Neural networks induction Remession Asset writedowns Bankruptcy ~31 [2, 6,8, 14, 191 [S, 19,21, 261 Bank failure Inventory accounting Lendingkredit risk Corporate acquisitions Corporate earnings Management fraud Mortgage choice Studies of Supervised Learning Approaches PREVIOUS RESEARCH HAS INVESTIGATED AND COMPARED SUPERVISED, inductive learning techniques in a number of domains (see Table 2), with mixed results. Some comparative studies suggest the superiority of neural networks over other techniques. For example, in bankruptcy prediction, Tam and Kiang [27] found that neural networks performed better than discriminant analysis, logit analysis, k-nearest neighbor, and tree induction (ID3). Fanning and Cogger [8] were somewhat more tentative in comparing neuralnetwork models with both logistic regression and existing bankruptcy models. Although they found no particular technique to be superior across all comparisons, they argued that neural nets were competitive with, and often superior to the logit and bankruptcy model results. In a subsequent study of fraudulent financial statements, Fanning and Cogger [9] reported that aneural network was better able than the traditional statistical methods to identify management fraud. By contrast, previous comparative studies had shown decision treehle induction to be superior to other methods. Messier and Hansen [21], for example, compared decision treeh.de induction with discriminant analysis, as well as with individual and group judgments. On the basis of the attributes selected and the percentage of correct predictions, they concluded that the induction technique outperformed the other approaches in the prediction of bankruptcies. Weiss and Kapouleas [30] compared statistical pattern recognition (linear and quadratic discriminant analysis, nearest neighbor, and Bayesian classification), neural networks, and machine learning methods (rule/decision tree induction methods: ID3K4.5 and Predictive Value Maximization). They concluded that the rule induction methods were superior to the other methods with respect to accuracy of classification, training time, and compatibility with human reasoning. Other multiple method studies have been less conclusive and suggest that performance is dependent on other factors such as the type of task and the nature of the data

10 46 SPANGLER, MAY, AND VARGAS set. Chung and Tam [6], for example, compared three inductive-learning models across five managerial tasks (in construction project assessment and bankruptcy prediction). They concluded that model performance generally was task-dependent, although neural networks tended to produce relatively consistent predictions across task domains. In assessing LIFO/FIFO classification methods, Liang et al. [20] reported that neural networks tended to perform best overall in holdout tests, and when the data contained dominant nominal variables. However, when nominal variables were not dominant, probit provided better performance. Sen and Gibbs [24] studied corporate takeover models, comparing six neural network models and logistic regression. They found little difference in predictive performance among them, indicating that they all performed poorly. Boritz et al. [2] tested the performance of neural networks with several regression techniques, as well as with well-known bankruptcy models. No approach was clearly superior, and the ability of an induced model to distinguish between bankrupt and nonbankrupt firms was dependent on the number of bankrupt firms in the training set. Bases for Judging Performance JUDGING THE PERFORMANCE OF ONE DATA-MINING METHOD over another requires consideration of several modeling objectives. Predictive Accuracy Most of the comparative studies we cited above measured the predictive accuracy and error rate of each method. Messier and Hansen [21], for example, compared the percentage of correct classifications produced by their induced rule system to the percentage drawn from discriminant analysis, as well as individual and group judgments. As suggested by the review above, it is difficult to make general claims about the relative predictive accuracy of the various methods. Performance is highly dependent on the domain and setting, the size and nature of the data set, the presence of noise and outliers in the data, and the validation technique(s) used. Predictive accuracy tends to be an important and prevalent indication of a method s performance, but others also are important. Comprehensibility Henery [ 131 uses this term to indicate the need for a classification method to provide clearly understood and justifiable decision support to a human manager or operator. TRI systems, because they explicitly structure the reasoning underlying the classification process, tend to have an inherent advantage over both traditional statistical classification models and ANN. Tessmer et al. [28] argue that, while the traditional statistical methods provide effkient predictive accuracy, they do not provide an explicit description ofthe classification process. Weiss and Kulikowski [3 11 suggest that any explanation resident in mathematical inferencing techniques is buried in

11 CHOOSING DATA-MINING METHODS 47 computations that are inaccessible to the mathematically uninclined. The results of such techniques might be misunderstood and misused. Rules and decision trees, on the other hand, are more compatible with human reasoning and explanations. Speed of Training and Classification Speed can be an important consideration in some situations [3 I]. Henery [ 131 suggests that a number of real-time applications, for example, must sacrifice some accuracy in order to classify and process items in a timely fashion. Again, because of situational dependencies, it is difficult to make generalizations about the computational expense of each method. ANNs estimated using backpropagation may require an unacceptably large amount of time [3 11. Modeling and Simulation of Human Decision Behavior Using case descriptions and human judgments as input, data-mining methods also can be used for the automated modeling and acquisition of expert knowledge. Kim et al. [16] determined that the performance of a particular method in constructing an inductive model of a human decision strategy is dependent, in part, on conformance ofthe model with the strategy. Linear models tend to simulate linear (orcompensatory) decision strategies more accurately, while nonlinear models are more appropriate for nonlinear (or noncompensatory) strategies. Kim et al. found ANN to be superior to decision tree induction (ID3), logistic regression, and discriminant analysis, even in simulations of linear decision processes. They note that the flexibility of neural networks in forming both linear and nonlinear decision models contributes to their superior performance relative to the other methods. Selection of Attributes The attributes selected for consideration and their relative influence on the outcome are an indication of the performance of a method. The concept of diagnostic validity of induction methods was proposed by Cunim et al. [7], and was used by Messier and Hansen [21] to compare the attributes selected by each of their induction methods. Data Set and Tools OUR DATA SET IS A CENSUS OF 59,864 CASES OF SURGERY PERFORMED between 1989 and 1995 at a large university teaching hospital (see [l] for a description of the computerized collection system). Each case is represented as a single record containing twenty-three attributes describing patient demographic and case-specific information. Of the twenty-three factors available, the following were chosen for analysis: diagnoses (one, two, or three ICD-9 codes), procedures (one, two, or three CPT codes), type of anesthesia (general, monitor, regional), patient demographic information (age, sex), the patient s overall condition (based on the six-value ASA ordinal coding

12 48 SPANGLER, MAY, AND VARGAS scheme), emergencyhonemergency status, in-patiendoutpatient status, and the surgeon who performed the procedure (identified by number). The remaining fields are the time durations of surgical events. We chose 8 19 records dealing with ICD-9 code 180.9, malignant neoplasm of the cervix, unspecified, because there is a fairly large fanout from it to the associated CPTs across the records, presenting a challenge to any classification method. ICD is associated with 139 different CPTs, although 107 of the CPTs appear in four or fewer records. Because the presence of outliers impedes the detection of general patterns, we followed the standard data-mining approach of removing them. Of the 8 19 records containing ICD , 160 records contained one of the 107 CPTs. Those records were judged to be outliers and were removed, leaving 659 records linked to a total of 32 CPTs remaining in the data set. Table 3 provides a detailed description of each of the 32 remaining CPTs. We used commercial software instead of programming the methods ourselves, to eliminate possible bias caused by our own computer skills. We used Statgraphics version 3.1 for LDA, BrainMaker version 3.1 for ANN, and See5 version 1.05 for TRI. Methodology WE IDENTIFIED TEN DISTINCT WAYS OF REPRESENTING THE MULTIPLE CLASSIFICATION problem. As shown in Table 4, not all methods are capable of estimating parameters for each of the representations. Our strategy was to evaluate each method from a decision support perspective. That is, how does a method fundamentally constrain the types of representation that can (and should) be employed by a person using the method? Discriminant Analysis Six LDA models were constructed, three basic representations with two variations on each. The three basic representations-multiple, replicated, and binary4iffer in their treatment of the dependent variables; recall that each case can be a member of one, two, or three classes. For each basic representation, two variations on the treatment of prior probabilities were included: (1) prior probabilities for each group are assumed to be equal, and (2) prior probabilities for each group are assumed to be proportional to the number observations in each group. The basic representations and variations are described below. Dependent Variables Represented as Multiple Values (LDAMult) The dependent variable is a string with all CPTs, space delimited. For example, the dependent variable in a record containing only CPT is represented as A record containing CPTs 58210,77760, and has as its dependent variable. Because one dependent variable is used for all CPT codes present, a single linear discriminant analysis could be performed for each of the two variations:

13 CHOOSING DATA-MINING METHODS 49 Table 3 Top 32 CPTs and Their Description CPT Frequency Description Placement of central venous catheter subclavian, jugular, or other *vein, e.g., for central venous pressure, hyperalimentation, hemodialysis, *or chemotherapy, *percutaneous, over age 2 Biopsy or excision of lymph nodes, superficial separate procedure Limited lymphadenectomy for staging separate procedure, pelvic and para-aortic Limited lymphadenectomy for staging separate procedure, retroperitoneal aortic and/or splenic Retroperitoneal transabdominal lymphadenectomy, extensive, including pelvic, aortic, and renal nodes separate procedure Enterectomy, resection of small intestine, single resection and anastomosis Proctosigmoidoscopy, rigid, diagnostic, with or without collection of specimens by brushing or washing separate procedure Cholecystectomy Exploratory laparotomy, exploratory celiotomy with or without biopsys separate procedure Exploration, retroperitoneal area with or without biopsys separate procedure Cystourethroscopy separate procedure Cystourethroscopy, with biopsy Cystourethroscopy, with insertion of indwelling ureteral stent, e.g., Gibbons or double4 type Laparoscopy, surgical, with retroperitoneal lymph node sampling biopsy, single or multiple Biopsy of vaginal mucosa, simple separate procedure *Pelvic examination under anesthesia *Biopsy, single or multiple, or local excision of lesion, with or without *fulguration separate procedure Cauterization of cervix, laser ablation Conization of cervix, with or without fulguration, with or without dilation and curettage, with or without repair, cold knife or laser Total abdominal hysterectomy corpus and cervix, with or without removal of tubes, with or without removal of ovaries Total abdominal hysterectomy, including partial vaginectomy, with para-aortic and pelvic lymph node sampling, with or without removal of tubes. with or without removal of ovaries equal prior probabilities (LDAMuZtE) and prior probabilities proportional to the sample (LDAMuZtP). The advantage of LDAMult is that each observation is represented exactly once. The disadvantage is its inability to represent class intersections. An observation that is a member of both categories a and b (i.e., dependent variable = a b ) is considered to be completely separate from observations that are members of only either a or b.

14 50 SPANGLER, MAY, AND VARGAS Table 3 Continued CPT Frequency Description Radical abdominal hysterectomy, with bilateral total pelvic lymphadenectomy and para-aortic lymph node sampling biopsy, with or without removal of tubes, with or without removal of ovaries Pelvic exenteration for gynecologic malignancy, with total abdominal hysterectomy or cervicectomy, with or without removal of tubes, with or without removal of ovaries, with removal of bladder and ureteral transplantations, and/or abdominoperineal resection of rectum and colon and colostomy, or any combination thereof Vaginal hysterectomy Salpingo-oophorectomy, complete or partial, unilateral or bilateral separate procedure Laparotomy, for staging or restaging of ovarian malignancy second look, with or without omentectomy, peritoneal washing, biopsy of abdominal and pelvic peritoneum, diaphragmatic assessment with pelvic and limited para-aortic lymphadenectomy Unlisted procedure, female genital system nonobstetrical Intracavitary radioelement application, simple Intracavitary radioelement application, intermediate Intracavitary radioelement application, complex Interstitial radioelement application, intermediate Interstitial radioelement application, complex Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep) A single record containing multiple values for the dependent variable is decomposed into multiple records, one for each value of the dependent variable. A record containing CPTs 58210,77760, and is represented three times in the data set, once with as the dependent variable, once with 77760, and the third with ; all three records have the same independent variable values (see figure 3). In this representation, because only the relative sizes of the classification fknction values are meaningful, a single-step estimation process provides a rank order for membership in each of the categories but does not provide any insight regarding the number of categories into which the observation is to be classified (unlike neural nets or logistic regression, for which 0.5 is a commonly accepted threshold). Therefore, a two-step process is required: First, use an LDA model to estimate the number of categories to which the observation belongs, and then use a separate LDA to determine what those categories are. LDARepE is that two-stage process with equal prior probabilities for both parts of the process, and LDARepP is the corresponding technique using proportional probabilities. The advantage of LDARep is that it recognizes an observation that is a member of

15 CHOOSING DATA-MINING METHODS 5 I Table 4 Model Representations Across Methods Neural networks Tree/rule induction Discriminant analvsis Multiple Replicated Equal Proportional Equal Proportional Binary No hidden layer Equal One hidden layer, 57 nodes Proportional One hidden layer, 114 nodes,i, Y I " I - "A, vu" "C" "4" Figure 3. Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep) Figure 4. Dependent Variables Represented as Single Binary Values (LDABin) more than one class as being in each of those classes individually. The disadvantage is that the representation does not differentiate between a single observation that is simultaneously in multiple classes, in which the replication of its independent variable values is a representational necessity, from multiple observations with identical independent variables that are, however, members of different classes. That is, a set intersection and a contradiction have the same representation. Dependent Variables Represented as Single Binary Values (LDABin) The dependent variable is represented as a series of binary values, one for each possible value of the dependent variable (see figure 4). An observation is considered a member of a category if its classification function value for assigning membership is larger than its classification hction value for not assigning membership. This representation requires a separate LDA model for each class, thirty-two in the case of our data set. LDABinE is this approach with equal probabilities, and LDABinP has proportional probabilities.

16 52 SPANGLER, MAY, AND VARGAS The advantages of this approach are that subset relationships are preserved and that each observation occurs only once in the data set, so that intersections are represented differently from contradictions. Its disadvantage is that an individual observation might be a member of no classes or too many classes (for our data, more than three). Neural Networks The ANN representation of the dependent variable the is same as in LDABin, because of the ways values are propagated through the network to the output nodes. The variations in the model are functions ofthe structure ofthe hidden layer(s). One hidden layer is all that needs to be considered, if structure between the input and output layers is desired, but the number of nodes in it is a matter of choice [3 11. With a hidden layer and a sigmoid function for combining activations, the ANN performs logistic regression. Deferring to commercial software, we allowed BrainMaker to suggest the size of a hidden layer. With 25 input nodes and output 32 nodes, it recommended 57, their sum. We also considered a network with twice that many nodes in the hidden layer, a structure that might tend to overfit the data. The modeling alternatives are: (I) a neural network with no hidden layer (NNO),(2) a network with 57 nodes in the hidden layer (NN57), and (3) a network with 114 nodes in the hidden layer (NNZ14). Activation at an output node, interpreted as degree of membership, ranges from zero to one. An observation is considered a member of any group for which it generates an activation value above 0.5, and is considered not a member of any group for which it generates an activation level below 0. All three ANN models have the same advantages and disadvantages as LDABin. The ANN model representation is preferable to that of LDABin because LDABin requires, in our case, thirty-two separate binary models, and ANN simultaneously models all thirty-two binary alternatives in a single model. Decision TreeRule Induction We used a single TRI representation, analogous to LDAMult. Multiple dependent variables in each record were represented collectively within a single string ( a b c ). A representation similar to that of LDARep is not possible because See5 trees do not rank-order the classification alternatives. See5 does allow for differential misclassification costs, but they are not capable of representing equal and proportional prior probabilities in a way equivalent to that in LDA. The advantages and disadvantages of LDAMult apply to our See5 representation. The model has the advantage of constraining possible categories to between 1 and 3, inclusive, and the disadvantage of not recognizing the intersections of classes. Results and Analysis COMPARISON OF THE METHODS REQUIRES THE CONSIDERATION Of a number of issues related to the measurement of performance in multiple classification problems. For

17 CHOOSNG DATA-MINIh G METHODS 53 example, consider a single test case having two values for the dependent variable ( 1 2 ). If the method predicts the value 1 2, there is little doubt that the method has performed without error. If the method predicts the value 3, it is incorrect for three reasons. First, it failed to recognize the case as having multiple classes. Second, it failed to include either 1 or 2 in its predicted value for the dependent variable. Three, it included an incorrect value (i.e., 3 ) in its prediction. If the method predicts 1 3, it has identified the correct number of cases (2), while also correctly identifying one of the values but incorrectly identifying the other. If the method predicts 1 2 3, it has identified both of the correct classes, but it also has predicted the wrong number of classes, and in doing so has included a class that is incorrect. If it predicts 1, it has predicted the wrong number of classes. However, the class it has predicted is correct, and it refrained from predicting any incorrect classes. The above list of error alternatives argues for what we call an audit matrix within which multiple classification results can be judged. We use the term audit to denote a situation assessment, performed by an auditor or decision maker, which attempts to reconcile the observed characteristics of a situation with a priori expectations of that situation (i.e., actual versus predicted characteristics-see figure 5). That is, an auditor, when initially encountering a situation, will expect to observe certain characteristics while also expecting not to observe others. Subsequently, if expected characteristics are observed and unexpected characteristics are not observed, the situation matches expectations and the classification judged is correct. However, predictions can vary from the observed situation in two important ways: (1) the auditor might expeceor predich characteristic that is not present (i.e., a false positive), or (2) the auditor might observe a characteristic that had not been predicted (i.e., a false negative). An audit matrix has four cells, two of which indicate correct behavior of a method and two of which indicate incorrect behavior. The two correct cells are the number of classes predicted to be present and actually observed and the number predicted to be absent and actually absent. The incorrect cells are the number of classes predicted to be present but actually absent, and the number of classes predicted to be absent that actually were present. For example, figure 6 shows an audit matrix for an observation that was predicted to be a member of class 3, but that actually is a member of classes 12. and Reduction ofan audit matrix to a single number could be done with a weighted linear function of the number of the cell values, such as classification score = W&- ( WspFP + WfiE;R), where W, Wfp. and Wfi are weights assigned to the number of matches, false positives, and false negatives, respectively, and A4 is the number of matches, FP is the number of false positives, and FN is the number of false negatives. The weights would be application-specific and would relate to the relative costs of two the types of misclassification. Table 5 illustrates our approach using the results from the ten models for a single observation that was a member of CPT classes 38500,338562, and (shown as value = 1 in the column labeled actual ). The columns labeled 1 through 10 under the header representation contain the output of each model given the values of the

18 54 SPANGLER, MAY, AND VARGAS Actual Predicted Present I Not present I Presenl 1 Match 1 False Positive 1 Not present False negative Match 1 Figure 5. An Audit Matrix for Structuring and Evaluating Multiple Classification Results Predicted Actual Present Not present Present -- 3 Notpresent 192 I Figure 6. An Audit Matix Example for Actual = 1 2 and Predicted = 3 independent variables of the observation (1 = CPT is predicted; 0 = CPT is not predicted). For example, LDABinE predicts that seven CPTs would be associated with this particular observation, but is incorrect on all seven (7 false positives, 3 false negatives, and 0 matches). NN57 predicts 5 CPTs and is correct on two (3 false positives; 1 false negative; 2 matches). Table 6 summarizes the results for all 650 cases in the data set. In the absence of a value function for relative misclassification costs, the performance of each of the representations can be measured by (1) the number of correct predictions relative to the number of misclassifications and (2) the relative number of false positives and false negatives. Table 7 compares the correct predictions (represented as a percentage: Observed & Predicted / C [misclassifications]* 100) across each of the representations, for each type of case (i.e., those cases containing I, 2, and 3 CPTs) and for all cases. As shown, 5 of the IO representations are able accurately to classify over 50 percent of the single CPT cases. Although NN57 is the most accurate model overall (92.41 percent), all of the neural network models as a group outperform the other representations. However, the performance of all representations deteriorates dramatically when classifying the multiple dependent variable cases. In the 2-CPT cases, LDARepP has the highest accuracy, but is still exceptionally poor (6.25 percent). In the 3-CPT cases, LDABinP is highest with 7.62 percent. Notably, the neural network models were among the poorest performers. The relative performance of the representations is also reflected in the number and nature of the classification errors. Those include the misclassification rate (the number misclas- of sifications divided by the number of observations in andthe each group), proportion of false

19 CHOOSING DATA-MINING METHODS 55 Table 5 Classification and Misclassification Results for an Observation Containing Three Dependent Variables* Representation CPT Actual ~ ~ Total predicted Total correct * Each column corresponds to one of the ten representations: I = LDABinE; 2 = LDABinP; 3 = LDAMultE; 4 = LDAMultP; 5 = LDARepE; 6 = LDARepP; 7 = See.5; 8 = NO; 9 = NN57; 10 = NNI 14. negatives and false positives, which are important in considering the relative costs of misclassifying observations. Figures 7 through 11 graphically show the relative number of false negatives and positives for all cases, and cases with 3 CPTs, 2 CPTs, and 1 CPT, respectively. Figure 1 1 shows results for all multiple CPT cases, combined.

20 56 SPANGLER, MAY, AND VARGAS Table 6 LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5 Audit Matrix of Results from Each Method and Representation Misclassifications Matches Observed & not Not observed & Observed & Not observed & predicted predicted predicted not predicted All CPTs CPTs LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN NN See CPTs LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN NN See CPT LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN NN See

The percentage of women 21-64 years of age who received one or more Pap tests to screen for cervical cancer.

The percentage of women 21-64 years of age who received one or more Pap tests to screen for cervical cancer. Measure Name: Cervical Cancer Screen Owner: NCQA (CCS) Measure Code: CER Lab Data: Y Rule Description: General Criteria Summary The percentage of women 21-64 years of age who received one or more Pap tests

More information

2016 Hysterectomy Reimbursement Fact Sheet

2016 Hysterectomy Reimbursement Fact Sheet 2016 Hysterectomy Reimbursement Fact Sheet The information contained in this document is provided for informational purposes only and represents no statement, promise, or guarantee by Ethicon concerning

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

2014 OB/GYN Surgery Medicare Reimbursement Coding Guide

2014 OB/GYN Surgery Medicare Reimbursement Coding Guide 2014 OB/GYN Surgery Medicare Reimbursement Coding Guide Effective January 1, 2014 Medicare National Average Rates and Allowables (Not Adjusted For Geography) CPT * HCPCS Code 58150 58152 58180 58200 58210

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Common Surgical Procedures Gyn/Oncology

Common Surgical Procedures Gyn/Oncology Malignancy Description Codes wrvu Comments Cervical Typical Open Cone biopsy 57520 4.11 Leep conization 57522 3.67 Colposcopy/Leep 57461 3.43 TAH +/- BSO 58150 17.31 Radical hysterectomy +/- BSO (Total,

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Data Mining Applications in Fund Raising

Data Mining Applications in Fund Raising Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Knowledge-based systems and the need for learning

Knowledge-based systems and the need for learning Knowledge-based systems and the need for learning The implementation of a knowledge-based system can be quite difficult. Furthermore, the process of reasoning with that knowledge can be quite slow. This

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

ANN Based Fault Classifier and Fault Locator for Double Circuit Transmission Line

ANN Based Fault Classifier and Fault Locator for Double Circuit Transmission Line International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-4, Special Issue-2, April 2016 E-ISSN: 2347-2693 ANN Based Fault Classifier and Fault Locator for Double Circuit

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) Leo Pipino University of Massachusetts Lowell Leo_Pipino@UML.edu David Kopcso Babson College Kopcso@Babson.edu Abstract: A series of simulations

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

PharmaSUG2011 Paper HS03

PharmaSUG2011 Paper HS03 PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Role of Neural network in data mining

Role of Neural network in data mining Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)

More information

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Electronic Payment Fraud Detection Techniques

Electronic Payment Fraud Detection Techniques World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141, 2012 Electronic Payment Fraud Detection Techniques Adnan M. Al-Khatib CIS Dept. Faculty of Information

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

What is Data Mining, and How is it Useful for Power Plant Optimization? (and How is it Different from DOE, CFD, Statistical Modeling)

What is Data Mining, and How is it Useful for Power Plant Optimization? (and How is it Different from DOE, CFD, Statistical Modeling) data analysis data mining quality control web-based analytics What is Data Mining, and How is it Useful for Power Plant Optimization? (and How is it Different from DOE, CFD, Statistical Modeling) StatSoft

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

Meta-learning. Synonyms. Definition. Characteristics

Meta-learning. Synonyms. Definition. Characteristics Meta-learning Włodzisław Duch, Department of Informatics, Nicolaus Copernicus University, Poland, School of Computer Engineering, Nanyang Technological University, Singapore wduch@is.umk.pl (or search

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Neural network models: Foundations and applications to an audit decision problem

Neural network models: Foundations and applications to an audit decision problem Annals of Operations Research 75(1997)291 301 291 Neural network models: Foundations and applications to an audit decision problem Rebecca C. Wu Department of Accounting, College of Management, National

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES

COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES JULIA IGOREVNA LARIONOVA 1 ANNA NIKOLAEVNA TIKHOMIROVA 2 1, 2 The National Nuclear Research

More information

Numerical Algorithms Group

Numerical Algorithms Group Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

NEURAL NETWORKS IN DATA MINING

NEURAL NETWORKS IN DATA MINING NEURAL NETWORKS IN DATA MINING 1 DR. YASHPAL SINGH, 2 ALOK SINGH CHAUHAN 1 Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India 2 Lecturer, United Institute of Management, Allahabad,

More information

IBM SPSS Neural Networks 22

IBM SPSS Neural Networks 22 IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,

More information

Healthcare Data Mining: Prediction Inpatient Length of Stay

Healthcare Data Mining: Prediction Inpatient Length of Stay 3rd International IEEE Conference Intelligent Systems, September 2006 Healthcare Data Mining: Prediction Inpatient Length of Peng Liu, Lei Lei, Junjie Yin, Wei Zhang, Wu Naijun, Elia El-Darzi 1 Abstract

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Decision Trees for Mining Data Streams Based on the Gaussian Approximation

Decision Trees for Mining Data Streams Based on the Gaussian Approximation International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 Decision Trees for Mining Data Streams Based on the Gaussian Approximation S.Babu

More information

Course Syllabus. Purposes of Course:

Course Syllabus. Purposes of Course: Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building

More information

Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information