Chapter-4 Forms & Steps in Data Mining Operations Introduction In today's market place, business managers must take timely advantage of high return opportunities. Doing so requires that they be able to exploit the mountains of data their organizations generate and collect during daily operations. Yet, difficulty of discerning the value in that information --- of separating the wheat form the chaff prevents many companies from fully capitalizing on the wealth of data at their disposal. For example, a bank account manager might want to identify a group of married, two income, affluent customers and send them information about the bank's growth mutual funds, before a competing discount broker can lore them away the information surely resides in the bank's computer system and has probably been there in some form for years. The trick of course, is to find an efficient way to extract and apply it. Data mining is the process of extracting valid previously unknown, comprehensible and actionable information from large data bases and using it to make crucial business decisions: currently performs this task for a growing range of business. After presenting a overview of current data mining techniques, it explores two particularly noteworthy applications of those techniques: market basket analysis and customer segmentations. 4.1 FORMS OF DATA MINING Data mining takes two forms: Verification- driven data mining extracts information in the process of validating a hypothesis postulated by a user it involves techniques such as statistical and multidimensional ctnalysis,,discovery - division data mining uses tools such as symbolic and neural clustering, association discovery, and supervised 55
induction to automatically extract information. The extracted information from both approaches takes one of several forms. regression of classification models relations between database records and Deviations from norms, among others. To be effective a data mining application must do three things. First, it must have access to organization- wide views of data, instead of department - specific ones. Frequently the organization's data is supplemented with open- source or purchased Data. The resulting database is called the data warehouse.. During data integration, the application often cleans the data - by removing duplicates, deriving missing values (when possible) - and establishing new, derived attributes, for example. Second, the data mining application must mine the information in the warehouse. Finally, it must organize and present the mined information in a way that enables discussions making. Systems that can satisfy one or more of these requirements range from commercial decision Support systems to customized decisions- support systems and executive information systems. The overall objective or each decision making operation determines the type of information to be mined and the ways for organizing the mined information. For example, by establishing the objective of identifying good prospective customers for mutual funds, the bank account manager mentioned earlier implicitly indicates that she wants to segment the database of bank customers into groups of related customers - such as urban, married, two - income, mid-thirties, low - risk, high -net-worth individuals --:- and establishes the vulnerability of each group of various types of promotional campaigns. 4.2 BASIC STEPS IN DATA MINING Once a data warehouse has been developed, the data mining process falls into four basic steps: data selection, data transformation, data mining and result interpretation. 56
4.2.1 DATA SELECTION A data warehouse contains a variety of data, not all of which is needed to achieve each data -mining goal. The first step in the data~mining process is to select the target data. For example, making databases contain data describing customer purchases, demographics and life style preferences. To identify which items and quantities to purchase for a particular store, as well as how to organize the items on the store's shelves a marketing executive might need only to combine customer purchase data with demographic data. The selected data types may be organize along multiple tables, during data selection; the user might need to perform table joins. Furthermore, even after selecting the desired database tables, mining the contents of the entire table is not always necessary for identifying useful information. Under certain conditions and for certain types of data- mining operations (such as when creating a classification of regression model), it is usually a less expensive operations to sample the appropriate table, which might have been created by joining other tables, and then mine only the sample. 4.2.2 DATA TRANSFORMATION I After selecting he desire database tables and identifying the data to be mined, the user typically needs to perform c~rtain transformation on the data. Three considerations dictate which transformation to use : the task (mailing - list creation, I for example), the data mining operations (Such as predicative modeling), and the data mining technique ( Such as neural networks) involved. Transformation methods include organizing data in desire ways (organizing individual consumer data by household), and converting one type of data to another (Changing nominal values into numeric ones so that they can be processed by a neurai network). 57
Another transformation type, the definition to new attributes (derived attributes) involves applying mathematical or logical operation oh the one or more data- base attributes - for Example, by defining the ratio of two attributes. 4.2.3 DATA MINING. The user subsequently mines the transformed data using one or more techniques to. extract the desire type of information. For example, to develop an accurate, symbolic classification model that predicts whether magazine subscribers will renew their subscriptions a circulation's manager might need to first use clustering to segment the subscriber database, then apply rule induction to automatically create a classification model for each desired cluster: 4.2.4 RESULTINTERPRETATION The user must finally analyze the mined information according to his decisionsupport task and goals. Such analysis identifies the best of the information. For example, if a classification model has been developed, during result interpretation, the data- mining application will test model's robustness, using established error - estimation methods such as cross validation. During this Step, the use must also deterinine how best to present the selected mining - operation results to the decision maker. Who will apply them in taking specific actions. (In certain domains, user of the data mining application _,. usually a business analyst - is not the decision, the user of the data - mining application - usually a business analyst- is not the decision maker. The latter may take business decisions by capitalizing on the data- mining results through a simple query and reporting tool.) For Example, the user might decide that the best way to present the classification model is logically in the form of if- the rules. Three observations emerge from this four- step process: Mining is only one step in the. overall process. The quality of the mined information is a function of both the effectiveness of the data- mmmg 58
technique used and the quality, and often size, of the data being mined. If users select the wrong data, choose inappropriate attributes, or transform the selected data inappropriately, the results will likely suffer. The process in not linear but involves a variety of feedback loops. After selecting a particular data - mining techniques, a user might determine that the selected data must be preprocessed in particular ways or that the applied technique did not produce results of the expected quality. The user then must repeat earlier steps, which might mean restarting the entire process from the beginning. Visualization plays an important role in the various steps. In particular, during the selection and transformation steps, a user could use statistical visualization -such as scatter plots or histograms- to display the result of exploratory data analysis. Such exploratory analyses often provide preliminary understanding of the data, which helps the user select certain data subsets, During the mining step, the user employs domain specific visualizations. Finally visualizationseither special landscapes or business graphics - can present the result of a mining operation. 4.3 VERIFICATION-DRIVEN DATA MINING OPERATIONS Seven operations are associated with data mining: three with verification driven data mining and four with discovery driven data mining. Verification-driven data-mining Operations. These include query and reporting multidimensional analysis, and statistical analysis. 4.3.1 QUERY AND REPORTING. This operation constitutes the most basic form of decision support and data mining. Its goal is to validate a hypothesis expressed by the user, such as "sales of. ' four - wheel -driven vehicles increase dl}ri?~ t~e winter".. 59
Validating a hypothesis through a query and reporting operation entails creating a query or set of queries, that best express the stated hypothesis, posing the query to the database, and analyzing the returned data to establish whether it supports or refutes the hypothesis.. Each data interpretation or analysis step might lead to additional queries, either new ones or refinements of the initial one. Reports subsequently compiled for distribution through - out an organization contain selected analyses results, presented in graphical, tabular and textual form and including a subset of the queries, because these include the queries, analysis can be automatically repeated at redefined times, such as once a month. 4.3.2 MULTIDIMENSIONAL ANALYSIS While traditional query and reporting suffices for several types of verification - driven date mining, effective data mining in certain domains requires the creation of very complex queries. These often contain an embedded temporal dimension and may also express change between two stated events. For example, the regional manager of a department store chain might say, "Show me weekly sales during the first quarter of 1994 and 1995,.for Midwestern stores, broken down by department". Multi-dimensional database, often implemented as multidimensional arrays. Organize data along predefined dimensions ( time or department, for Example), Have facilities for taking advantage of sparsely populated portions of the multidimensional structure, and provide specialized language that facilitate queering along dimensions while expending query - processing performance. These databases also allow hierarchical organization of the data along each dimension, with summaries on the higher levels of the hierarchy and the actual data at the lower levels. Quarterly sales might take one level of summarization and monthly sales a second level, with the actual daily sales taking the lowest level of the hierarchy. 60
4.3.3 STATISTICAL ANALYSIS Simple statistical analysis operations (Such as first - order statistics) usually executive execute during both query and reporting, as well as during multidimensional analysis. Verifying more complex hypotheses, however, requires statistical operations (such as principal -component analysis regression modeling), coupled with data visualization tools. (SAS, SPSS, S+) incorporate components. that. can be used for discovery-driven modeling (such as CHAID in SPSS and S+), to be effective, statistical analysis must rest on a methodology, such as exploratory data analysis. A methodology might need to be a business of domain-department, so statistics tools such as SAS and SPSS are open -ended, providing function libraries.. that can be organize into larger analysis software systems. 4.4 Evaluation Measures Since multi-label classification has been investigated mostly in text I categorisation, there is very little work conducted on developing evaluation measures 'i''! ',.,. for its classifiers. There are no standard evaluation.techniques applicable to the multilabel classification problems. Moreover, the right measure is often problematic and depends heavily on the features of the conducted problem, such as those used in [3]. In this section, we introduce three evaluation measures suitable for the majority of binary, multi-class and multi-label classification problems. 4.4.1 Top-label This evaluation measure takes into consideration only the top-ranked class label and ignores any other labels associated with an instance. For traditional classification task where there is only one class label to assign to the test object, and given an instance and its associated class label <d, y>, a classifier H predicts a list of ranked I 2 3 k. class labels Yj = < Yj, Yj ' Yj,... Yj > If the predicted first class label matches the true class label y of the instance, i.e. Y 1 1 = y, then the classification is correct. The top label method estimates how many times the top-ranked class label is I the correct class label So, for a set of single-class instances/=< (xl, yl), (:x2, 61
y2),...,(xm, ym)>, the top-label is 1/m ~m =J( Jj 1 = y) where m represents the number of instances. 4.5 Entropy-based Associative Classifier: We denote as class association rules (CARs) [18] those association rules of the form X! c, where the antecedent (X) is composed of feature variables and the consequent (c) is just a class. CARs may be generated by a slightly modified association rule mining algorithm. Each itemset must contain a. class and the rule generation also follows a template in which the consequent is just a class. CARs are essentially decision rules, and as in the case of decision trees, CARs are ranked in decreasing order of information gain. Finally, during the testing phase, the associative classifier simply checks whether each CAR matches the test instance; the class associated with the first match is chosen. Note that, seen in the light of CARs, a decision tree is simply a greedy search for CARs, using a level-wise search algorithm that only expands the current best rule with other features. On the other hand, an eager associative classifier mines all possible CARs with a given minsup. It is also interesting to note that sorting the final rule-set on information gain, and using the best CAR for classification, is also a greedy strategy. While the greedy approach has its limitations, eager associative classifiers -are not limited by the prefix problem of decision rules, that is, once -the best feature is chosen at each node, all nodes under that subtree must contain it. let D be the set of all n training instances. let T be the set of all m test instances. 1. Let Ce be the set of all rules {X! c} mined from D 2. Sort Ce according to information gain 3. for each ti 2 T do 4. Pick the first rule {X! c} 2 Ce I X_ ti 5. Predict class c 62
This shows the basic steps of the eager associative classifier. In the initial step, the algorithm mines all frequent CARs, and sorts them in descending order of information gain. Then, for each test instance ti, the first CAR matching ti is used to predict the class. It shows an associative classifier built from our example set of training instances, using the above algorithm. Three CARs match the test instance of our example (last row of Table 1 ): 1. {windy=false and temperature=cool!play=yes} 2. { outlook=sunny and humidity=high!play=no} 3. { outlook=sunny and temperature=cool! play=yes} Rule { windy=false and temperature=cool! play=yes} would be selected, since it is the best ranked CAR. By applying this CAR, the test instance will be correctly classified. Intuitively, associative classifiers perform better than decision trees because associative classifiers allow several CARs to cover the same partition of the training data. In our example, the test case is recognized by only one rule in the decision tree, while the same test case is recognized by three CARs in the associative classifier. Selecting the proper CAR to be applied is an issue in associative classification. Next we present a theoretical discussion about the performance of decision trees and eager associative classifiers. Theorem 1 The rules derived. from. a decision tree are a subset of the CARs mined using an eager associative classifier based on information gain. Proof 1 Let maxe be the maximum entropy of all decision tree rules. Select a set Ce from all CARs such that their entropy is at most maxe. It is clear that the decision tree rules are a subset of Ce. Theorem 1 states that, for a given minsup, CARs contain (at least) all information of the corresponding decision tree. Since each decision tree rule may be seem as a CAR, and since all possible CARs were enumerated, then the decision tree can be built by choosing the proper CARs. Theorem 2 CARs perform no worse than decision tree rules, according to the information gain principle. 63
Proof 2 Given an instance to be classified, and, without loss of generality, a decision tree with just pure leaves, the decision tree predicts class c for that instance. We analyze two scenarios: first, just one CAR matches the instance; and s-econd, more than one CAR matches. When just one CAR matches, it is the same as the decision tree rule, since the set of CARs subsumes the set of decision rules. In this case, the - associative classifier and the decision tree make the same prediction. When more than one CAR matches an instance, the prediction may be either the same class (say c) as the matching decision rule or another class. If the associative classifier predicts c then the two approaches are equivalent. In case a class other than c is predicted, by definition, the best matching CAR provides a better information gain than the decision rule, and thus, according to the information gain principle, the CAR will make a better prediction. Theorem 2 states that the additional CARs of the associative classifier that are riot in the decision tree, cannot degrade the classification accuracy. This is because an additional CAR is only used if it is better than all decision rules (according to the information gain principle). However, eager associative classifiers generate a large number of CARs, most of which are useless during classification. For instance, from the set of 13 CARs showed in Figure 4, only 3 match the test instance (the remaining 10 CARs are useless). Next, we present a lazy classifier and compare it to the eager version described in this section. 4.6 Lazy Associative Classifier Unlike the eager associative classifier that extracts a set of ranked CARs from the training data, the lazy associative classifier induces CARs specific to each test instance. The lazy approach projects the training.data, D, only on those features in the test instance, A. From this projected training data, DA, the CARs are induced and I ranked, and the best CAR is used. From the set of all training instances, D, only the instances sharing at least one feature with the test instance A are used to form DA. 64
Then, a rule-set Cl A is generated from DA. Since DA contains only features in A, all CARs generated from DA must match A. The lazy associative classifier is presented in Figure 5. let D be the set of all n training instances let T be the set of all m test instances. 1. for each ti 2 T do 2. let Dti be the projection ofd on features only from ti 3. let Cti be the set of all rules {X! c} mined from Dti 4. sort C 1 ti according to information gain 5. pick the first rule {X! c} 2 C 1 1i, and predict class c Figure 5. Lazy Associative Classifier Now we demonstrate that the lazy associative classifier produces better results than its eager counterpart. Given a test instance A, and a set of CARs C, we denote by CA those CARs {X! c} in C where X A. 4.6.1 Any-label This evaluation technique measures how many times any of the predicted labels of an instance matches. the actual class label in all cases of that instance in the test data. If any of the predicted class labels of an instanced matches the true class label y. 4.6.2 Label-weight This technique enables each predicted label for an instance to play a role in classifying a test case, basedo~ its ranking,a~d therefore it could be considered as a multi-label evaluation measure. An instance may belong to several class labels, each one associated with it by a number of occurrences in the training data. Each class label can be assigned a weight according to how many times that label has been associated with the instance. Let rule rj be associated with a list of ranked labels. 65
We have conducted an extensive perfonnance study to evaluate accuracy and efficiency of CP AR and compare it with that of C4.5 [8], RIPPER [3], CBA [7] and CMAR [6]. As in [7] and [6], 26 datasets from UCI Machine Learning Repository are used. All the experiments are perfonned on a 1.7GHz Pentium-4 PC with 1GB main memory. All the approaches are implemented by their authors. The parameters of CP AR are set as the following. In the rule generation algorithm is set to 0:05, min gain to 0:7, and_ to 2=3. The best 5 rules are used in prediction. Table 1 shows the accuracy of the _ve approaches on 26 datasets from UCI ML Repository. 10-fold cross validation is used for every dataset. Table 2 compares the running (training) time of RIPPER, CMAR (which is claimed to be more e cient than CBA and CP AR on the 26 datasets. Notice that Table 2 uses both arithmetic and geometric average. This is because the running times of di_erent datasets di_er a lot, and the arithmetic average is dominated by the most time-consuming datasets. Using geometric average, equal weight is put on every dataset. Thus we consider geometric average as a more reasonable measure. Table 3 shows the average number of rules used in RIPPER, CMAR and CP AR. 66
Dataset c4.5 ripper cba cmar cpar anneal 94.8 95.8 7.9 97.3 98.4 austral 84.7 87.3 84.9 86.1 86.2 auto 80.1 72.8 78.3 78.1 82.0 breast 95.0 95.1 96.3 96.4 96.0 cleve 78.2 82.2 82.8 82.2 81.5 crx 84.9 84.9 84.7 84.9 85.7 diabetes 74.2 74.7 74.5 75.8 75.1 german 72.3 69.8 73.4 74.9 73.4 glass 68.7 69.1 73.9 70.1 74.4 heart 80.8 80.7 81.9 82.2 82.6 h~atic 80.6 76.7 81.8 80.5 79.4 horse 82.6 84.8 82.1 82.6 84.2 hypo 99.2 98.9 98.9 98.4 98.1 10n0 90.0 91.2 92.3 91.5 92.6.. ms 95.3 94.0 94.7 94.0 94.7 labor 79.3 84.0 86.3 89.7 84.7 led7 73.5 69.7 71.9 72.5 73.6 lymph 73.5 79.0 77.8 83.1 82.3 pima 75.5 73.1 72.9 75.1 73.8 sick 98.5 97.7 97.0 97.5 96.8 sonar 70.2 78.4 77.5 79.4 79.3 tic-tac 99.4 98.0 99.6 99.2 98.6 vehicle 72.6 62.7 68.7 68.8 69.5 waveform 78.1 76.0 80.0 83.2 80.9 wme 92.7 91.6 95.0 95.0 95.5 zoo 92.2 88.1 96.8 97.1 95.1 Average 83.34 82.93 84.69 85.22 85.17 Table 1: Accuracy: C4.5, RIPPER,CBA, CMAR and CPAR RIPPER CMAR CPAR Arithmetic average 0.218 30.24 0.555 Geometric average 0.036 2.877 0.105 Table 2: Running time (in sec.): RIPPER, CMAR and CP AR 6 67
RIPPER CMAR CPAR Arithmetic average 8.20 305 244 Geometric average 5.74 185 106 Table 3: Number of rules: RIPPER, CMAR and CP AR 68