Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support

Transcription

1 Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support WILLIAM E. SPANGLER, JERROLD H. MAY, AND LUIS G. VARGAS WILLIAM E. SPANGLER is an Assistant Professor in the College of Business and Economics at West Virginia University. After several years in private industry, he earned his Ph.D. in 1995 from the Katz Graduate School of Business, University of Pittsburgh, specializing in artificial intelligence. His current research interests focus on data mining and computational modeling for decision support. His work has been published in various journals, including Information and Management, Interfaces, Expert Systems with Applications, and IEEE Transactions on Knowledge and Data Engineering. JERROLD H. MAY is a Professor of Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and is also the Director of the Artificial Intelligence in Management (AIM) Laboratory there. He has more than sixty refereed publications in a variety of outlets, ranging from management journals such as Operations Research and Information Systems Research to medical ones such as Anesthesiology and Journal of the American Medical Informatics Association. Professor May s current work focuses on modeling, planning, and control problems, the solutions to which combine management science, statistical analysis, and artificial intelligence, particularly for operational tasks in health-related applications. LUIS G. VARGAS is a Professor of Decision Sciences and Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and Co-Director of the AIM Laboratory. He has published over forty publications in refereed journals such as Management Science, Operations Research, Anesthesiology, and Journal of the American Medical Informatics Association, and three books on applications of the Analytic Hierarchy Process with Thomas L. Saaty. Professor Vargas s current work focuses on the use of operations research and artificial intelligence methods in health care environments. ABSTRACT: Data-mining techniques are designed for classification problems in which each observation is a member of one and only one category. We formulate ten data representations that could be used to extend those methods to problems in which observations may be full members of multiple categories. We propose an audit matrix methodology for evaluating the performance of three popular data-mining techniques-linear discriminant analysis, neural networks, and decision tree inductiov JournalofManagementInformation Systems I Summer 1999, Vol. 16, No. I, pp M.E. Sharpe, Inc I 1999 $

2 38 SPANGLER, MAY, AND VARGAS using the representations that each technique can accommodate. We then empirically test our approach on an actual surgical data set. Tree induction gives the lowest rate of false positive predictions, and a version of discriminant analysis yields the lowest rate of false negatives for multiple category problems, but neural networks give the best overall results for the largest multiple classification cases. There is substantial room for improvement in overall performance for all techniques. KEY WORDS AND PHRASES: data mining, decision support systems, decision tree induction, neural networks, statistical classification. DATA MINING IS THE SEARCH THROUGH REAL-WORLD DATA for general patterns that are useful in classifying individual observations and in making reasoned predictions about outcomes [ 1 I]. That generally entails the use of statistical methods to link a series of independent variables that collectively describe a case or observation to the dependent variable(s) that classifies the case. The set of classification problems typically includes patterns containing a single, well-defined dependent variable or category; that is, an observation is assigned to one and only one category. This research explores the less-tractable problems of multiple classification in data mining, wherein a single observation may be classified into more than one category. Because multiple classification is a significant aspect of numerous managerial tasks-including, among others, diagnosis, auditing, and schedulineunderstanding the effectiveness of datamining methods in multiple classification situations has important implications for the use of information systems for knowledge-based decision support. By multipleclassijication, we mean that the categories are well defined and mutually exclusive, but the observations themselves transcend categorical boundaries. This contrasts withfuzzy clustering, in which the categories themselves are not necessarily either well defmed or mutually exclusive. Consider, for the example, universe ofcategories that includes men, researchers, andjazz singers. In this situation, a single person can belong to any combination of these categories simultaneously, in effect having multiple membership across categories. ClassifLing someone in such a context requires recognizing the potential for multiple membership, while also identifjing the correct categories themselves. Figures 1 and 2 show two alternative ways of pictorially representing multiple classification problems. In figure 1, the categories are shown as distinct and mutually exclusive, with individual cases or observations transcending categories. Figure 2 shows the categories in a type of Venn diagram, with multiply classified cases appearing in the intersections between categories. In contrast to the situation we consider, when the categories themselves are poorly defined or understood, classification of observations within those categories may be correspondingly uncertain. Fuzzy clustering assigns observations with likelihoods to multiple categories, where the likelihoods sum to one. Multiple Classification Problems-An Example DATA MINING IN A DIAGNOSTIC SETTING IS THE SEARCH for patterns linking state descriptions with associated outcomes, with the objective of predicting an outcome

3 CHOOSING DATA-MINING METHODS 39 Figure 1. Observations Classified into Multiple Categories: Observation 01 is in Category C1, 02 is in CI and C2, while 03 is in CI, C2, and C3. Figure 2. A Multiple ClassificationDomainPicturedas a Venn Diagram, with Multiply Classified Observations Appearing the Intersections given new data showing a similar pattern. It can be characterized as an attempt to extract knowledge from data. In the medical domain, numerical codes are used to describe both what is wrong with patient a and what was done to treat the patient. The dominant patient state taxonomy is the International Classification of Diseases (IC- 9) coding system, which indicates a patient s disease or condition within hierarchical, a numerical classification scheme. The corresponding procedural taxonomy is the Common Procedural Terminology (CPT) system, which indicates the procedure(s) performed on a patient, partly in response to the ICD-9 code@) assigned to the patient. Our empirical results are derived from 59,864 patient records, each of which includes the patient s diagnoses (ICD-9s), the procedures performed on the patient (CPTs), and patient demographic and case-specific information.

4 ~~~~ 40 SPANGLER, MAY, AND VARGAS The data-mining task is to find patterns linking patient ICD-9 and demographic information to CPT outcomes. The task is important for surgical scheduling because patient diagnoses and demographics are known before the patient enters the operating room, but the procedures that will be performed there often are not known with certainty. The surgery performed is a function of the information discovered by the physicians during the course of the operation. Data mining in this situation is a multiple-classification task because several procedures may be performed on a patient during a single operation. The patient records to which we had access contain as many as three CPTs each. The ICD-9(s) and patient demographics are the independent variables in the analysis. The surgical procedures performed on the patient (CPTs) are the dependent variables. Each set of independent variables is linked to one, two, or three CPTs. The identification of the procedures to be performed on a patient is important for a number of managerial tasks in this domain, including medical auditing, operating room scheduling, and materials management. If a surgical scheduler knew in advance the most likely sets of procedures and had models for estimating times for sets of procedures, the scheduler could more optimally plan the operating room schedule. The problem for a data-mining method in this multiple classification domain is twofold: First, the method must be able to identify the proper number of CPTs associated with a specific pattern. That is, if a set of patient factors is normally associated with two CPTs, the method should construct a pattern linking the factors to two and only two CPTs. Second, the method should identify the specific CPTs. Scoring the performance of a data-mining tool requires consideration of both of these aspects of the problem. Comparison of Data-Mining Methods THE EXAMPLE ABOVE HINTS AT THE CHARACTERISTICS of the multiple classification problem that make it interesting as well as dificult. Our goal is to find and to propose solutions to the following associated problems of multiple classification: Problem representation: How should a decision maker structure a model both to recognize and to identify multiple classes? Should the dependent variables be treated individually, as a series of yeslno questions related to the presence or absence of each variable, even when they occur as a group in a single record? Alternatively, should a group of dependent variables be treated as a separate entity, distinct from the individual variables that comprise the group? Performance measurement: In multiple classification, there is potential for both false positives (i.e., assigning an observation to an incorrect class) and false negatives (i.e., not assigning an observation to a correct class). A decision maker requires a strategy for scoring the performance of various data-mining methods, based on the relative number of both types of errors and their associated costs. Those research issues could be investigated using mathematical arguments or numerically. While a mathematical comparison would be the most definitive one, we are not aware of a methodology that would permit it to be done. Numerical research

5 CHOOSING DATA-MINING METHODS 41 Table 1 Comparison of Data-Mining Methods Decision tree induction Neural networks Discriminant analysis Type of method Logic-based Math-based Math-based Learning approach Supervised Supervised Supervised Linearity Linear Nonlinear Linear Representational Set of decision Functional Functional scheme nodes and branches: relationship relationship production system between attributes between attributes and classes and classes can be done with artificially generated data or with real data. Conclusions based on empirical research are useful if the characteristics of the samples on which they are based are sufficiently similar to those of problem instances others are likely to encounter in practice. We preferred using a real data set to the use of generated data because it is representative of an important class of managerial problems, and because the noise in the data set provides us with information on the sensitivity of the approaches to dirt in a data set. Artificial data would have allowed us to carehlly control the population parameters from which the data are drawn, but would have required that we first define and estimate values for all such critical population parameters. Using our real data set, we empirically compare the performance of tree and rule induction (TRI), artificial neural networks (ANN), and linear discriminant analysis (LDA) in modeling the multiple classification patterns in our data set. We chose the three methods because each is a popular data-mining method sharing a number of common characteristics while also exhibiting some notable differences (see Table 1). Weiss and Indurkhya divide data-mining algorithms into three groups: math-based methods, distance-based methods, and logic-based methods [29]. LDA is the most common math-based method, as well as the most common classification technique in general, while TRI is the most common logic-based method. Neural networks are an increasingly popular nonlinear math-based method. Tools employing these methods are commonly available in commercial computer-based applications. All three methods are supervised learning techniques. That is, they induce rules for assigning observations to predefined classes from a set of examples, as opposed to unsupervised techniques, which both define classes and determine classification rules [20, 251. Supervised learning techniques are appropriate for our decision problems because the classes (CPT codes) are defined exogenously and cannot be modified by the decision maker. Each of the methods we compare engages in discrete classification through a process of selection and combination of case attributes, and each employs similar validation techniques, described below. Cluster analysis and knowledge discovery methods are examples of unsupervised learning algorithms. The methods also differ, particularly in the way they model the relationships among

6 42 SPANGLER, MAY, AND VARGAS attributes and classes. The classification structures of LDA and ANN are expressed mathematically as a functional relationship between weighted attributes and resulting classes. TRI represents relationships as a set of decision nodes and branches, which, in turn, can be represented as a production system, or set of rules. LDA and TRI are linear approaches; NN is nonlinear. The effectiveness of the representation generally depends on the orientation and mathematical sophistication of the user. Liang argues that, because the choice of a learning method for an application is an important problem, research into the comparison of alternative (or perhaps even complementary) methods is likewise important [ 191. That is especially true for data mining, where the costs and potential benefits involved strongly motivate the proper choice of tool and method as well as the proper analysis of the results. Tree and Rule Induction TRI is attractive because its explicit representation of classification as a series of binary splits makes the induced knowledge structure easy to understand and validate. TRI constructs a tree, but the tree can be translated into equivalent an set of rules. We used Quinlan s See5 package, the most recent version of his ID3 algorithm [22]. ID3 induces a decision tree from a table of individual cases, each of which describes identified attributes as well as the class to which the case belongs. At each node, the algorithm builds the tree by assessing the conditional probabilities linking attributes and outcomes, and divides the subset of cases under consideration into two further subsets so as to minimize entropy, a measure of the information content of the data. The user specifies parameters that control the stopping behavior of the method. If the training set contains no contradictory casesthat is, cases with identical attributes that are members of different classesa fully grown tree will produce an error rate of zero on the training set. Weiss and Kulikowski [31] show that as a tree becomes more complex, measured by the number of decision nodes it contains, the danger of overfitting the data increases, and the predictive power of the tree declines commensurately. That is, the true predictive error rate, measured by the performance of the tree on test cases, becomes much higher than the apparent error rate reflected in the performance of the tree against the training cases alone. To minimize the true error rate, See5 first grows the tree completely, and then prunes it based on a prespecified certainty factor at each node. Performance evaluation ofclassification methodologies is discussed in the statistical literature (for example, see [15], ch. 11). The two most common approaches are dividing the data set into training and holdout subsets before estimating the model and jackknifing. The former avoids the bias of using the same information for both creating and judging the model. However, it requires large data sets, and there is no simple way to provide a definitive rule for determining either the size or composition of the two subsets. Worse, separating a holdout sample results in the creation of a model that is not the desired one, because the removal ofthe holdout sample reduces the information content of the training set and may exclude cases that are critical to the estimation of an accurate or robust model. The alternative to separating the data into two groups is

7 CHOOSING DATA-MINING METHODS 43 jackknifing, a one-at-a-time holdout procedure due to Lachenbruch [ 181. Jackknifing temporarily ignores the first observation, estimates the model using observations two through n, classifies the first, held-out observation, and notes whether it was correctly or incorrectly classified. It then puts the first observation back into the data set, ignores the second, and repeats the process. Repeating the procedure n times, jackknifing creates n different models and tallies up overall modeling performance based on the behavior of each of those models on a single omitted observation. Omitting only a single observation minimizes the loss of information to which the modeling process is exposed, but it can require a lot of computer time and does not produce a single model as its result. How do you combine n potentially very different models, and how do you interpret its evaluation when each holdout sample was tested on a different model? The TRI software package See5 includes a k-fold cross-validation, an approach less extreme than either a fixed holdout sample or jackknifing. K-fold crossvalidation divides the data set into k equal-sized partitions, ignores them one at a time, estimates a model, and computes its error rate on the ignored partition. Neural Networks Artificial neural networks simulate human cognition by modeling the inherent parallelism of neural circuits found in the brain using mathematical models of how the circuits function. The models typically are composed of a layer of input nodes (independent variables), one or more layers of intermediate (or hidden) nodes, and a layer of output nodes (dependent variables). Nodes in a layer are each connected by one-way arcs to nodes in subsequent layers, and signals are sent over those arcs. Behavior propagates from values set in the input nodes, sent over arcs through the hidden layer(s), and results in the establishment of values in the output layer. The value of a node is a nonlinear, usually logistic, function of the weighted sum of the values sent to it by nodes that are connected to it. A node forwards a signal to a subsequent node only if it exceeds a threshold value. An ANN model is specified by defining the number of layers it has, the number of nodes in each layer, the way in which the nodes are connected, and the nonlinear function used to compute node values. Estimation of the specified model involves determining the best set of weights for the arcs and threshold values for the nodes. An ANN is trainebthat is, its parameters are estimatehsing nonlinear optimization. In the backpropugation algorithm, the first-order gradient descent method used in the Brainmaker software we used, the network propagates inputs through the network, derives a set of output values, compares the computed output to the provided (corresponding) output, and calculates the difference between the two numbers (Le., the error). If a difference exists, the algorithm proceeds backward through the hidden layer(s) to the input layer, adjusting the weights between connections based on their gradients to reduce the sum of squared output errors. The algorithm stops when the total error is acceptably small. Neural networks are frequently used in data mining because, in adjusting the number of layers, nodes, and connections, the user can make an ANN model almost any smooth

8 44 SPANGLER, MAY, AND VARGAS mathematical function. While inputs to an ANN might be integer or discrete, the weighted nonlinear transformations of the inputs as part of their being fed forward through the network result in continuous output level values. Continuous output levels result in a more tractable error measure for the backpropagation algorithm to optimize and also permit the interpretation of outputs as partial group membership. Partial group membership means an ANN is capable of representing inexact matching, if that is the way to find a best fit for some set of input data. It also can model classification tasks that are inherently fuzzy --that is, tasks that generally are simple for humans but traditionally difficult for computers. Because of their flexibility, ANNs may be difficult to specify. Adding too much structure to an ANN makes it prone to overfitting, but too little structure may prevent it from capturing the patterns in the data set. Those patterns are represented in the arc (connection) weights and the node thresholds, a form that is not transparent to humans. Computationally, if the training set is large, backpropagation and related algorithms may require a lot of time. ANNs may be good classifiers and predictors as compared with linear methods, but the mathematical representations of the various nodes, and the relative importance of the independent variables, tend to be somewhat less accessible to the end user than induced decision trees, rules, and even classification functions. Neural networks often are treated as a black box, with only the inputs and outputs visible to the decision maker. The classification chosen by the ANN is easily visible to the user, but the decision process that led to that classification is not. Linear Discriminant Analysis Linear discriminant analysis (LDA) is the most common classification method in use, and also one of the oldest [ 131, having been developed Fisher by in the 1930s. Because of its popularity and long history, we provide only a brief overview of the method here. Like TRI and ANN, LDA partitions a data set into two or more classes, within which new observations or cases can then be assigned. Because it uses linear functions of the independent variables to define those partitions, LDA is similar to multiple regression. The primary distinction between LDA and multiple regression lies in the form of the dependent variable. Multiple regression uses a continuous dependent variable. The dependent variable in LDA is ordinal or nominal. For a data set of cases, each with m attributes and n categories, LDA constructs classification hnctions of the form, c1 a, + cp c, a, + c, where ci is the coefficient for the case attribute ai and c,, is a constant, for each of the n categories. An observation is assigned to the class for which it has the highest classification function value.

9 CHOOSING DATA-MINING METHODS 45 Table 2 Comparative Data-Mining Method Studies Across Domains Tree/rule Neural networks induction Remession Asset writedowns Bankruptcy ~31 [2, 6,8, 14, 191 [S, 19,21, 261 Bank failure Inventory accounting Lendingkredit risk Corporate acquisitions Corporate earnings Management fraud Mortgage choice Studies of Supervised Learning Approaches PREVIOUS RESEARCH HAS INVESTIGATED AND COMPARED SUPERVISED, inductive learning techniques in a number of domains (see Table 2), with mixed results. Some comparative studies suggest the superiority of neural networks over other techniques. For example, in bankruptcy prediction, Tam and Kiang [27] found that neural networks performed better than discriminant analysis, logit analysis, k-nearest neighbor, and tree induction (ID3). Fanning and Cogger [8] were somewhat more tentative in comparing neuralnetwork models with both logistic regression and existing bankruptcy models. Although they found no particular technique to be superior across all comparisons, they argued that neural nets were competitive with, and often superior to the logit and bankruptcy model results. In a subsequent study of fraudulent financial statements, Fanning and Cogger [9] reported that aneural network was better able than the traditional statistical methods to identify management fraud. By contrast, previous comparative studies had shown decision treehle induction to be superior to other methods. Messier and Hansen [21], for example, compared decision treeh.de induction with discriminant analysis, as well as with individual and group judgments. On the basis of the attributes selected and the percentage of correct predictions, they concluded that the induction technique outperformed the other approaches in the prediction of bankruptcies. Weiss and Kapouleas [30] compared statistical pattern recognition (linear and quadratic discriminant analysis, nearest neighbor, and Bayesian classification), neural networks, and machine learning methods (rule/decision tree induction methods: ID3K4.5 and Predictive Value Maximization). They concluded that the rule induction methods were superior to the other methods with respect to accuracy of classification, training time, and compatibility with human reasoning. Other multiple method studies have been less conclusive and suggest that performance is dependent on other factors such as the type of task and the nature of the data

10 46 SPANGLER, MAY, AND VARGAS set. Chung and Tam [6], for example, compared three inductive-learning models across five managerial tasks (in construction project assessment and bankruptcy prediction). They concluded that model performance generally was task-dependent, although neural networks tended to produce relatively consistent predictions across task domains. In assessing LIFO/FIFO classification methods, Liang et al. [20] reported that neural networks tended to perform best overall in holdout tests, and when the data contained dominant nominal variables. However, when nominal variables were not dominant, probit provided better performance. Sen and Gibbs [24] studied corporate takeover models, comparing six neural network models and logistic regression. They found little difference in predictive performance among them, indicating that they all performed poorly. Boritz et al. [2] tested the performance of neural networks with several regression techniques, as well as with well-known bankruptcy models. No approach was clearly superior, and the ability of an induced model to distinguish between bankrupt and nonbankrupt firms was dependent on the number of bankrupt firms in the training set. Bases for Judging Performance JUDGING THE PERFORMANCE OF ONE DATA-MINING METHOD over another requires consideration of several modeling objectives. Predictive Accuracy Most of the comparative studies we cited above measured the predictive accuracy and error rate of each method. Messier and Hansen [21], for example, compared the percentage of correct classifications produced by their induced rule system to the percentage drawn from discriminant analysis, as well as individual and group judgments. As suggested by the review above, it is difficult to make general claims about the relative predictive accuracy of the various methods. Performance is highly dependent on the domain and setting, the size and nature of the data set, the presence of noise and outliers in the data, and the validation technique(s) used. Predictive accuracy tends to be an important and prevalent indication of a method s performance, but others also are important. Comprehensibility Henery [ 131 uses this term to indicate the need for a classification method to provide clearly understood and justifiable decision support to a human manager or operator. TRI systems, because they explicitly structure the reasoning underlying the classification process, tend to have an inherent advantage over both traditional statistical classification models and ANN. Tessmer et al. [28] argue that, while the traditional statistical methods provide effkient predictive accuracy, they do not provide an explicit description ofthe classification process. Weiss and Kulikowski [3 11 suggest that any explanation resident in mathematical inferencing techniques is buried in

11 CHOOSING DATA-MINING METHODS 47 computations that are inaccessible to the mathematically uninclined. The results of such techniques might be misunderstood and misused. Rules and decision trees, on the other hand, are more compatible with human reasoning and explanations. Speed of Training and Classification Speed can be an important consideration in some situations [3 I]. Henery [ 131 suggests that a number of real-time applications, for example, must sacrifice some accuracy in order to classify and process items in a timely fashion. Again, because of situational dependencies, it is difficult to make generalizations about the computational expense of each method. ANNs estimated using backpropagation may require an unacceptably large amount of time [3 11. Modeling and Simulation of Human Decision Behavior Using case descriptions and human judgments as input, data-mining methods also can be used for the automated modeling and acquisition of expert knowledge. Kim et al. [16] determined that the performance of a particular method in constructing an inductive model of a human decision strategy is dependent, in part, on conformance ofthe model with the strategy. Linear models tend to simulate linear (orcompensatory) decision strategies more accurately, while nonlinear models are more appropriate for nonlinear (or noncompensatory) strategies. Kim et al. found ANN to be superior to decision tree induction (ID3), logistic regression, and discriminant analysis, even in simulations of linear decision processes. They note that the flexibility of neural networks in forming both linear and nonlinear decision models contributes to their superior performance relative to the other methods. Selection of Attributes The attributes selected for consideration and their relative influence on the outcome are an indication of the performance of a method. The concept of diagnostic validity of induction methods was proposed by Cunim et al. [7], and was used by Messier and Hansen [21] to compare the attributes selected by each of their induction methods. Data Set and Tools OUR DATA SET IS A CENSUS OF 59,864 CASES OF SURGERY PERFORMED between 1989 and 1995 at a large university teaching hospital (see [l] for a description of the computerized collection system). Each case is represented as a single record containing twenty-three attributes describing patient demographic and case-specific information. Of the twenty-three factors available, the following were chosen for analysis: diagnoses (one, two, or three ICD-9 codes), procedures (one, two, or three CPT codes), type of anesthesia (general, monitor, regional), patient demographic information (age, sex), the patient s overall condition (based on the six-value ASA ordinal coding

12 48 SPANGLER, MAY, AND VARGAS scheme), emergencyhonemergency status, in-patiendoutpatient status, and the surgeon who performed the procedure (identified by number). The remaining fields are the time durations of surgical events. We chose 8 19 records dealing with ICD-9 code 180.9, malignant neoplasm of the cervix, unspecified, because there is a fairly large fanout from it to the associated CPTs across the records, presenting a challenge to any classification method. ICD is associated with 139 different CPTs, although 107 of the CPTs appear in four or fewer records. Because the presence of outliers impedes the detection of general patterns, we followed the standard data-mining approach of removing them. Of the 8 19 records containing ICD , 160 records contained one of the 107 CPTs. Those records were judged to be outliers and were removed, leaving 659 records linked to a total of 32 CPTs remaining in the data set. Table 3 provides a detailed description of each of the 32 remaining CPTs. We used commercial software instead of programming the methods ourselves, to eliminate possible bias caused by our own computer skills. We used Statgraphics version 3.1 for LDA, BrainMaker version 3.1 for ANN, and See5 version 1.05 for TRI. Methodology WE IDENTIFIED TEN DISTINCT WAYS OF REPRESENTING THE MULTIPLE CLASSIFICATION problem. As shown in Table 4, not all methods are capable of estimating parameters for each of the representations. Our strategy was to evaluate each method from a decision support perspective. That is, how does a method fundamentally constrain the types of representation that can (and should) be employed by a person using the method? Discriminant Analysis Six LDA models were constructed, three basic representations with two variations on each. The three basic representations-multiple, replicated, and binary4iffer in their treatment of the dependent variables; recall that each case can be a member of one, two, or three classes. For each basic representation, two variations on the treatment of prior probabilities were included: (1) prior probabilities for each group are assumed to be equal, and (2) prior probabilities for each group are assumed to be proportional to the number observations in each group. The basic representations and variations are described below. Dependent Variables Represented as Multiple Values (LDAMult) The dependent variable is a string with all CPTs, space delimited. For example, the dependent variable in a record containing only CPT is represented as A record containing CPTs 58210,77760, and has as its dependent variable. Because one dependent variable is used for all CPT codes present, a single linear discriminant analysis could be performed for each of the two variations:

13 CHOOSING DATA-MINING METHODS 49 Table 3 Top 32 CPTs and Their Description CPT Frequency Description Placement of central venous catheter subclavian, jugular, or other *vein, e.g., for central venous pressure, hyperalimentation, hemodialysis, *or chemotherapy, *percutaneous, over age 2 Biopsy or excision of lymph nodes, superficial separate procedure Limited lymphadenectomy for staging separate procedure, pelvic and para-aortic Limited lymphadenectomy for staging separate procedure, retroperitoneal aortic and/or splenic Retroperitoneal transabdominal lymphadenectomy, extensive, including pelvic, aortic, and renal nodes separate procedure Enterectomy, resection of small intestine, single resection and anastomosis Proctosigmoidoscopy, rigid, diagnostic, with or without collection of specimens by brushing or washing separate procedure Cholecystectomy Exploratory laparotomy, exploratory celiotomy with or without biopsys separate procedure Exploration, retroperitoneal area with or without biopsys separate procedure Cystourethroscopy separate procedure Cystourethroscopy, with biopsy Cystourethroscopy, with insertion of indwelling ureteral stent, e.g., Gibbons or double4 type Laparoscopy, surgical, with retroperitoneal lymph node sampling biopsy, single or multiple Biopsy of vaginal mucosa, simple separate procedure *Pelvic examination under anesthesia *Biopsy, single or multiple, or local excision of lesion, with or without *fulguration separate procedure Cauterization of cervix, laser ablation Conization of cervix, with or without fulguration, with or without dilation and curettage, with or without repair, cold knife or laser Total abdominal hysterectomy corpus and cervix, with or without removal of tubes, with or without removal of ovaries Total abdominal hysterectomy, including partial vaginectomy, with para-aortic and pelvic lymph node sampling, with or without removal of tubes. with or without removal of ovaries equal prior probabilities (LDAMuZtE) and prior probabilities proportional to the sample (LDAMuZtP). The advantage of LDAMult is that each observation is represented exactly once. The disadvantage is its inability to represent class intersections. An observation that is a member of both categories a and b (i.e., dependent variable = a b ) is considered to be completely separate from observations that are members of only either a or b.

14 50 SPANGLER, MAY, AND VARGAS Table 3 Continued CPT Frequency Description Radical abdominal hysterectomy, with bilateral total pelvic lymphadenectomy and para-aortic lymph node sampling biopsy, with or without removal of tubes, with or without removal of ovaries Pelvic exenteration for gynecologic malignancy, with total abdominal hysterectomy or cervicectomy, with or without removal of tubes, with or without removal of ovaries, with removal of bladder and ureteral transplantations, and/or abdominoperineal resection of rectum and colon and colostomy, or any combination thereof Vaginal hysterectomy Salpingo-oophorectomy, complete or partial, unilateral or bilateral separate procedure Laparotomy, for staging or restaging of ovarian malignancy second look, with or without omentectomy, peritoneal washing, biopsy of abdominal and pelvic peritoneum, diaphragmatic assessment with pelvic and limited para-aortic lymphadenectomy Unlisted procedure, female genital system nonobstetrical Intracavitary radioelement application, simple Intracavitary radioelement application, intermediate Intracavitary radioelement application, complex Interstitial radioelement application, intermediate Interstitial radioelement application, complex Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep) A single record containing multiple values for the dependent variable is decomposed into multiple records, one for each value of the dependent variable. A record containing CPTs 58210,77760, and is represented three times in the data set, once with as the dependent variable, once with 77760, and the third with ; all three records have the same independent variable values (see figure 3). In this representation, because only the relative sizes of the classification fknction values are meaningful, a single-step estimation process provides a rank order for membership in each of the categories but does not provide any insight regarding the number of categories into which the observation is to be classified (unlike neural nets or logistic regression, for which 0.5 is a commonly accepted threshold). Therefore, a two-step process is required: First, use an LDA model to estimate the number of categories to which the observation belongs, and then use a separate LDA to determine what those categories are. LDARepE is that two-stage process with equal prior probabilities for both parts of the process, and LDARepP is the corresponding technique using proportional probabilities. The advantage of LDARep is that it recognizes an observation that is a member of

15 CHOOSING DATA-MINING METHODS 5 I Table 4 Model Representations Across Methods Neural networks Tree/rule induction Discriminant analvsis Multiple Replicated Equal Proportional Equal Proportional Binary No hidden layer Equal One hidden layer, 57 nodes Proportional One hidden layer, 114 nodes,i, Y I " I - "A, vu" "C" "4" Figure 3. Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep) Figure 4. Dependent Variables Represented as Single Binary Values (LDABin) more than one class as being in each of those classes individually. The disadvantage is that the representation does not differentiate between a single observation that is simultaneously in multiple classes, in which the replication of its independent variable values is a representational necessity, from multiple observations with identical independent variables that are, however, members of different classes. That is, a set intersection and a contradiction have the same representation. Dependent Variables Represented as Single Binary Values (LDABin) The dependent variable is represented as a series of binary values, one for each possible value of the dependent variable (see figure 4). An observation is considered a member of a category if its classification function value for assigning membership is larger than its classification hction value for not assigning membership. This representation requires a separate LDA model for each class, thirty-two in the case of our data set. LDABinE is this approach with equal probabilities, and LDABinP has proportional probabilities.

16 52 SPANGLER, MAY, AND VARGAS The advantages of this approach are that subset relationships are preserved and that each observation occurs only once in the data set, so that intersections are represented differently from contradictions. Its disadvantage is that an individual observation might be a member of no classes or too many classes (for our data, more than three). Neural Networks The ANN representation of the dependent variable the is same as in LDABin, because of the ways values are propagated through the network to the output nodes. The variations in the model are functions ofthe structure ofthe hidden layer(s). One hidden layer is all that needs to be considered, if structure between the input and output layers is desired, but the number of nodes in it is a matter of choice [3 11. With a hidden layer and a sigmoid function for combining activations, the ANN performs logistic regression. Deferring to commercial software, we allowed BrainMaker to suggest the size of a hidden layer. With 25 input nodes and output 32 nodes, it recommended 57, their sum. We also considered a network with twice that many nodes in the hidden layer, a structure that might tend to overfit the data. The modeling alternatives are: (I) a neural network with no hidden layer (NNO),(2) a network with 57 nodes in the hidden layer (NN57), and (3) a network with 114 nodes in the hidden layer (NNZ14). Activation at an output node, interpreted as degree of membership, ranges from zero to one. An observation is considered a member of any group for which it generates an activation value above 0.5, and is considered not a member of any group for which it generates an activation level below 0. All three ANN models have the same advantages and disadvantages as LDABin. The ANN model representation is preferable to that of LDABin because LDABin requires, in our case, thirty-two separate binary models, and ANN simultaneously models all thirty-two binary alternatives in a single model. Decision TreeRule Induction We used a single TRI representation, analogous to LDAMult. Multiple dependent variables in each record were represented collectively within a single string ( a b c ). A representation similar to that of LDARep is not possible because See5 trees do not rank-order the classification alternatives. See5 does allow for differential misclassification costs, but they are not capable of representing equal and proportional prior probabilities in a way equivalent to that in LDA. The advantages and disadvantages of LDAMult apply to our See5 representation. The model has the advantage of constraining possible categories to between 1 and 3, inclusive, and the disadvantage of not recognizing the intersections of classes. Results and Analysis COMPARISON OF THE METHODS REQUIRES THE CONSIDERATION Of a number of issues related to the measurement of performance in multiple classification problems. For

17 CHOOSNG DATA-MINIh G METHODS 53 example, consider a single test case having two values for the dependent variable ( 1 2 ). If the method predicts the value 1 2, there is little doubt that the method has performed without error. If the method predicts the value 3, it is incorrect for three reasons. First, it failed to recognize the case as having multiple classes. Second, it failed to include either 1 or 2 in its predicted value for the dependent variable. Three, it included an incorrect value (i.e., 3 ) in its prediction. If the method predicts 1 3, it has identified the correct number of cases (2), while also correctly identifying one of the values but incorrectly identifying the other. If the method predicts 1 2 3, it has identified both of the correct classes, but it also has predicted the wrong number of classes, and in doing so has included a class that is incorrect. If it predicts 1, it has predicted the wrong number of classes. However, the class it has predicted is correct, and it refrained from predicting any incorrect classes. The above list of error alternatives argues for what we call an audit matrix within which multiple classification results can be judged. We use the term audit to denote a situation assessment, performed by an auditor or decision maker, which attempts to reconcile the observed characteristics of a situation with a priori expectations of that situation (i.e., actual versus predicted characteristics-see figure 5). That is, an auditor, when initially encountering a situation, will expect to observe certain characteristics while also expecting not to observe others. Subsequently, if expected characteristics are observed and unexpected characteristics are not observed, the situation matches expectations and the classification judged is correct. However, predictions can vary from the observed situation in two important ways: (1) the auditor might expeceor predich characteristic that is not present (i.e., a false positive), or (2) the auditor might observe a characteristic that had not been predicted (i.e., a false negative). An audit matrix has four cells, two of which indicate correct behavior of a method and two of which indicate incorrect behavior. The two correct cells are the number of classes predicted to be present and actually observed and the number predicted to be absent and actually absent. The incorrect cells are the number of classes predicted to be present but actually absent, and the number of classes predicted to be absent that actually were present. For example, figure 6 shows an audit matrix for an observation that was predicted to be a member of class 3, but that actually is a member of classes 12. and Reduction ofan audit matrix to a single number could be done with a weighted linear function of the number of the cell values, such as classification score = W&- ( WspFP + WfiE;R), where W, Wfp. and Wfi are weights assigned to the number of matches, false positives, and false negatives, respectively, and A4 is the number of matches, FP is the number of false positives, and FN is the number of false negatives. The weights would be application-specific and would relate to the relative costs of two the types of misclassification. Table 5 illustrates our approach using the results from the ten models for a single observation that was a member of CPT classes 38500,338562, and (shown as value = 1 in the column labeled actual ). The columns labeled 1 through 10 under the header representation contain the output of each model given the values of the

18 54 SPANGLER, MAY, AND VARGAS Actual Predicted Present I Not present I Presenl 1 Match 1 False Positive 1 Not present False negative Match 1 Figure 5. An Audit Matrix for Structuring and Evaluating Multiple Classification Results Predicted Actual Present Not present Present -- 3 Notpresent 192 I Figure 6. An Audit Matix Example for Actual = 1 2 and Predicted = 3 independent variables of the observation (1 = CPT is predicted; 0 = CPT is not predicted). For example, LDABinE predicts that seven CPTs would be associated with this particular observation, but is incorrect on all seven (7 false positives, 3 false negatives, and 0 matches). NN57 predicts 5 CPTs and is correct on two (3 false positives; 1 false negative; 2 matches). Table 6 summarizes the results for all 650 cases in the data set. In the absence of a value function for relative misclassification costs, the performance of each of the representations can be measured by (1) the number of correct predictions relative to the number of misclassifications and (2) the relative number of false positives and false negatives. Table 7 compares the correct predictions (represented as a percentage: Observed & Predicted / C [misclassifications]* 100) across each of the representations, for each type of case (i.e., those cases containing I, 2, and 3 CPTs) and for all cases. As shown, 5 of the IO representations are able accurately to classify over 50 percent of the single CPT cases. Although NN57 is the most accurate model overall (92.41 percent), all of the neural network models as a group outperform the other representations. However, the performance of all representations deteriorates dramatically when classifying the multiple dependent variable cases. In the 2-CPT cases, LDARepP has the highest accuracy, but is still exceptionally poor (6.25 percent). In the 3-CPT cases, LDABinP is highest with 7.62 percent. Notably, the neural network models were among the poorest performers. The relative performance of the representations is also reflected in the number and nature of the classification errors. Those include the misclassification rate (the number misclas- of sifications divided by the number of observations in andthe each group), proportion of false

19 CHOOSING DATA-MINING METHODS 55 Table 5 Classification and Misclassification Results for an Observation Containing Three Dependent Variables* Representation CPT Actual ~ ~ Total predicted Total correct * Each column corresponds to one of the ten representations: I = LDABinE; 2 = LDABinP; 3 = LDAMultE; 4 = LDAMultP; 5 = LDARepE; 6 = LDARepP; 7 = See.5; 8 = NO; 9 = NN57; 10 = NNI 14. negatives and false positives, which are important in considering the relative costs of misclassifying observations. Figures 7 through 11 graphically show the relative number of false negatives and positives for all cases, and cases with 3 CPTs, 2 CPTs, and 1 CPT, respectively. Figure 1 1 shows results for all multiple CPT cases, combined.

20 56 SPANGLER, MAY, AND VARGAS Table 6 LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5 Audit Matrix of Results from Each Method and Representation Misclassifications Matches Observed & not Not observed & Observed & Not observed & predicted predicted predicted not predicted All CPTs CPTs LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN NN See CPTs LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN NN See CPT LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN NN See