Chapter-4. Forms & Steps in Data Mining Operations. Introduction



Similar documents
Chapter 20: Data Analysis

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Social Media Mining. Data Mining Essentials

not possible or was possible at a high cost for collecting the data.

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

The Data Mining Process

Machine Learning and Data Mining. Fundamentals, robotics, recognition

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

An Overview of Knowledge Discovery Database and Data mining Techniques

Decision Tree Learning on Very Large Data Sets

Categorical Data Visualization and Clustering Using Subjective Factors

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

An Overview of Database management System, Data warehousing and Data Mining

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

ANALYTICS CENTER LEARNING PROGRAM

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Employer Health Insurance Premium Prediction Elliott Lui

Data Mining Analytics for Business Intelligence and Decision Support

Distributed forests for MapReduce-based machine learning

Introduction to Data Mining

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Chapter 12 Discovering New Knowledge Data Mining

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Customer Classification And Prediction Based On Data Mining Technique

New Approach of Computing Data Cubes in Data Warehousing

Information Management course

Discretization and grouping: preprocessing steps for Data Mining

Data Mining Applications in Higher Education

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining for Fun and Profit

Data Mining Solutions for the Business Environment

Foundations of Business Intelligence: Databases and Information Management

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Data quality in Accounting Information Systems

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Visualization methods for patent data

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Implementation of Data Mining Techniques to Perform Market Analysis

Data Mining Applications in Fund Raising

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Data Mining Classification: Decision Trees

A New Approach for Evaluation of Data Mining Techniques

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

An Introduction to Data Mining

Scoring the Data Using Association Rules

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Big Data: Rethinking Text Visualization

Introduction. A. Bellaachia Page: 1

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

SPATIAL DATA CLASSIFICATION AND DATA MINING

Introducing diversity among the models of multi-label classification ensemble

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining and Neural Networks in Stata

Database Marketing, Business Intelligence and Knowledge Discovery

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

D-optimal plans in observational studies

Data Preprocessing. Week 2

Classification algorithm in Data mining: An Overview

Data Mining Individual Assignment report

Learning is a very general term denoting the way in which agents:

Predicting earning potential on Adult Dataset

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Data Mining for Knowledge Management. Classification

How To Make A Credit Risk Model For A Bank Account

Chapter ML:XI (continued)

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Fluency With Information Technology CSE100/IMT100

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining Part 5. Prediction

Data Mining and Visualization

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Understanding Characteristics of Caravan Insurance Policy Buyer

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Analysis Tools and Libraries for BigData

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data, Measurements, Features

AMIS 7640 Data Mining for Business Intelligence

Protein Protein Interaction Networks

CHAPTER 1 INTRODUCTION

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

from Larson Text By Susan Miertschin

Data Warehousing and Data Mining in Business Applications

The University of Jordan

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Meta-learning. Synonyms. Definition. Characteristics

Business Intelligence. Data Mining and Optimization for Decision Making

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

Transcription:

Chapter-4 Forms & Steps in Data Mining Operations Introduction In today's market place, business managers must take timely advantage of high return opportunities. Doing so requires that they be able to exploit the mountains of data their organizations generate and collect during daily operations. Yet, difficulty of discerning the value in that information --- of separating the wheat form the chaff prevents many companies from fully capitalizing on the wealth of data at their disposal. For example, a bank account manager might want to identify a group of married, two income, affluent customers and send them information about the bank's growth mutual funds, before a competing discount broker can lore them away the information surely resides in the bank's computer system and has probably been there in some form for years. The trick of course, is to find an efficient way to extract and apply it. Data mining is the process of extracting valid previously unknown, comprehensible and actionable information from large data bases and using it to make crucial business decisions: currently performs this task for a growing range of business. After presenting a overview of current data mining techniques, it explores two particularly noteworthy applications of those techniques: market basket analysis and customer segmentations. 4.1 FORMS OF DATA MINING Data mining takes two forms: Verification- driven data mining extracts information in the process of validating a hypothesis postulated by a user it involves techniques such as statistical and multidimensional ctnalysis,,discovery - division data mining uses tools such as symbolic and neural clustering, association discovery, and supervised 55

induction to automatically extract information. The extracted information from both approaches takes one of several forms. regression of classification models relations between database records and Deviations from norms, among others. To be effective a data mining application must do three things. First, it must have access to organization- wide views of data, instead of department - specific ones. Frequently the organization's data is supplemented with open- source or purchased Data. The resulting database is called the data warehouse.. During data integration, the application often cleans the data - by removing duplicates, deriving missing values (when possible) - and establishing new, derived attributes, for example. Second, the data mining application must mine the information in the warehouse. Finally, it must organize and present the mined information in a way that enables discussions making. Systems that can satisfy one or more of these requirements range from commercial decision Support systems to customized decisions- support systems and executive information systems. The overall objective or each decision making operation determines the type of information to be mined and the ways for organizing the mined information. For example, by establishing the objective of identifying good prospective customers for mutual funds, the bank account manager mentioned earlier implicitly indicates that she wants to segment the database of bank customers into groups of related customers - such as urban, married, two - income, mid-thirties, low - risk, high -net-worth individuals --:- and establishes the vulnerability of each group of various types of promotional campaigns. 4.2 BASIC STEPS IN DATA MINING Once a data warehouse has been developed, the data mining process falls into four basic steps: data selection, data transformation, data mining and result interpretation. 56

4.2.1 DATA SELECTION A data warehouse contains a variety of data, not all of which is needed to achieve each data -mining goal. The first step in the data~mining process is to select the target data. For example, making databases contain data describing customer purchases, demographics and life style preferences. To identify which items and quantities to purchase for a particular store, as well as how to organize the items on the store's shelves a marketing executive might need only to combine customer purchase data with demographic data. The selected data types may be organize along multiple tables, during data selection; the user might need to perform table joins. Furthermore, even after selecting the desired database tables, mining the contents of the entire table is not always necessary for identifying useful information. Under certain conditions and for certain types of data- mining operations (such as when creating a classification of regression model), it is usually a less expensive operations to sample the appropriate table, which might have been created by joining other tables, and then mine only the sample. 4.2.2 DATA TRANSFORMATION I After selecting he desire database tables and identifying the data to be mined, the user typically needs to perform c~rtain transformation on the data. Three considerations dictate which transformation to use : the task (mailing - list creation, I for example), the data mining operations (Such as predicative modeling), and the data mining technique ( Such as neural networks) involved. Transformation methods include organizing data in desire ways (organizing individual consumer data by household), and converting one type of data to another (Changing nominal values into numeric ones so that they can be processed by a neurai network). 57

Another transformation type, the definition to new attributes (derived attributes) involves applying mathematical or logical operation oh the one or more data- base attributes - for Example, by defining the ratio of two attributes. 4.2.3 DATA MINING. The user subsequently mines the transformed data using one or more techniques to. extract the desire type of information. For example, to develop an accurate, symbolic classification model that predicts whether magazine subscribers will renew their subscriptions a circulation's manager might need to first use clustering to segment the subscriber database, then apply rule induction to automatically create a classification model for each desired cluster: 4.2.4 RESULTINTERPRETATION The user must finally analyze the mined information according to his decisionsupport task and goals. Such analysis identifies the best of the information. For example, if a classification model has been developed, during result interpretation, the data- mining application will test model's robustness, using established error - estimation methods such as cross validation. During this Step, the use must also deterinine how best to present the selected mining - operation results to the decision maker. Who will apply them in taking specific actions. (In certain domains, user of the data mining application _,. usually a business analyst - is not the decision, the user of the data - mining application - usually a business analyst- is not the decision maker. The latter may take business decisions by capitalizing on the data- mining results through a simple query and reporting tool.) For Example, the user might decide that the best way to present the classification model is logically in the form of if- the rules. Three observations emerge from this four- step process: Mining is only one step in the. overall process. The quality of the mined information is a function of both the effectiveness of the data- mmmg 58

technique used and the quality, and often size, of the data being mined. If users select the wrong data, choose inappropriate attributes, or transform the selected data inappropriately, the results will likely suffer. The process in not linear but involves a variety of feedback loops. After selecting a particular data - mining techniques, a user might determine that the selected data must be preprocessed in particular ways or that the applied technique did not produce results of the expected quality. The user then must repeat earlier steps, which might mean restarting the entire process from the beginning. Visualization plays an important role in the various steps. In particular, during the selection and transformation steps, a user could use statistical visualization -such as scatter plots or histograms- to display the result of exploratory data analysis. Such exploratory analyses often provide preliminary understanding of the data, which helps the user select certain data subsets, During the mining step, the user employs domain specific visualizations. Finally visualizationseither special landscapes or business graphics - can present the result of a mining operation. 4.3 VERIFICATION-DRIVEN DATA MINING OPERATIONS Seven operations are associated with data mining: three with verification driven data mining and four with discovery driven data mining. Verification-driven data-mining Operations. These include query and reporting multidimensional analysis, and statistical analysis. 4.3.1 QUERY AND REPORTING. This operation constitutes the most basic form of decision support and data mining. Its goal is to validate a hypothesis expressed by the user, such as "sales of. ' four - wheel -driven vehicles increase dl}ri?~ t~e winter".. 59

Validating a hypothesis through a query and reporting operation entails creating a query or set of queries, that best express the stated hypothesis, posing the query to the database, and analyzing the returned data to establish whether it supports or refutes the hypothesis.. Each data interpretation or analysis step might lead to additional queries, either new ones or refinements of the initial one. Reports subsequently compiled for distribution through - out an organization contain selected analyses results, presented in graphical, tabular and textual form and including a subset of the queries, because these include the queries, analysis can be automatically repeated at redefined times, such as once a month. 4.3.2 MULTIDIMENSIONAL ANALYSIS While traditional query and reporting suffices for several types of verification - driven date mining, effective data mining in certain domains requires the creation of very complex queries. These often contain an embedded temporal dimension and may also express change between two stated events. For example, the regional manager of a department store chain might say, "Show me weekly sales during the first quarter of 1994 and 1995,.for Midwestern stores, broken down by department". Multi-dimensional database, often implemented as multidimensional arrays. Organize data along predefined dimensions ( time or department, for Example), Have facilities for taking advantage of sparsely populated portions of the multidimensional structure, and provide specialized language that facilitate queering along dimensions while expending query - processing performance. These databases also allow hierarchical organization of the data along each dimension, with summaries on the higher levels of the hierarchy and the actual data at the lower levels. Quarterly sales might take one level of summarization and monthly sales a second level, with the actual daily sales taking the lowest level of the hierarchy. 60

4.3.3 STATISTICAL ANALYSIS Simple statistical analysis operations (Such as first - order statistics) usually executive execute during both query and reporting, as well as during multidimensional analysis. Verifying more complex hypotheses, however, requires statistical operations (such as principal -component analysis regression modeling), coupled with data visualization tools. (SAS, SPSS, S+) incorporate components. that. can be used for discovery-driven modeling (such as CHAID in SPSS and S+), to be effective, statistical analysis must rest on a methodology, such as exploratory data analysis. A methodology might need to be a business of domain-department, so statistics tools such as SAS and SPSS are open -ended, providing function libraries.. that can be organize into larger analysis software systems. 4.4 Evaluation Measures Since multi-label classification has been investigated mostly in text I categorisation, there is very little work conducted on developing evaluation measures 'i''! ',.,. for its classifiers. There are no standard evaluation.techniques applicable to the multilabel classification problems. Moreover, the right measure is often problematic and depends heavily on the features of the conducted problem, such as those used in [3]. In this section, we introduce three evaluation measures suitable for the majority of binary, multi-class and multi-label classification problems. 4.4.1 Top-label This evaluation measure takes into consideration only the top-ranked class label and ignores any other labels associated with an instance. For traditional classification task where there is only one class label to assign to the test object, and given an instance and its associated class label <d, y>, a classifier H predicts a list of ranked I 2 3 k. class labels Yj = < Yj, Yj ' Yj,... Yj > If the predicted first class label matches the true class label y of the instance, i.e. Y 1 1 = y, then the classification is correct. The top label method estimates how many times the top-ranked class label is I the correct class label So, for a set of single-class instances/=< (xl, yl), (:x2, 61

y2),...,(xm, ym)>, the top-label is 1/m ~m =J( Jj 1 = y) where m represents the number of instances. 4.5 Entropy-based Associative Classifier: We denote as class association rules (CARs) [18] those association rules of the form X! c, where the antecedent (X) is composed of feature variables and the consequent (c) is just a class. CARs may be generated by a slightly modified association rule mining algorithm. Each itemset must contain a. class and the rule generation also follows a template in which the consequent is just a class. CARs are essentially decision rules, and as in the case of decision trees, CARs are ranked in decreasing order of information gain. Finally, during the testing phase, the associative classifier simply checks whether each CAR matches the test instance; the class associated with the first match is chosen. Note that, seen in the light of CARs, a decision tree is simply a greedy search for CARs, using a level-wise search algorithm that only expands the current best rule with other features. On the other hand, an eager associative classifier mines all possible CARs with a given minsup. It is also interesting to note that sorting the final rule-set on information gain, and using the best CAR for classification, is also a greedy strategy. While the greedy approach has its limitations, eager associative classifiers -are not limited by the prefix problem of decision rules, that is, once -the best feature is chosen at each node, all nodes under that subtree must contain it. let D be the set of all n training instances. let T be the set of all m test instances. 1. Let Ce be the set of all rules {X! c} mined from D 2. Sort Ce according to information gain 3. for each ti 2 T do 4. Pick the first rule {X! c} 2 Ce I X_ ti 5. Predict class c 62

This shows the basic steps of the eager associative classifier. In the initial step, the algorithm mines all frequent CARs, and sorts them in descending order of information gain. Then, for each test instance ti, the first CAR matching ti is used to predict the class. It shows an associative classifier built from our example set of training instances, using the above algorithm. Three CARs match the test instance of our example (last row of Table 1 ): 1. {windy=false and temperature=cool!play=yes} 2. { outlook=sunny and humidity=high!play=no} 3. { outlook=sunny and temperature=cool! play=yes} Rule { windy=false and temperature=cool! play=yes} would be selected, since it is the best ranked CAR. By applying this CAR, the test instance will be correctly classified. Intuitively, associative classifiers perform better than decision trees because associative classifiers allow several CARs to cover the same partition of the training data. In our example, the test case is recognized by only one rule in the decision tree, while the same test case is recognized by three CARs in the associative classifier. Selecting the proper CAR to be applied is an issue in associative classification. Next we present a theoretical discussion about the performance of decision trees and eager associative classifiers. Theorem 1 The rules derived. from. a decision tree are a subset of the CARs mined using an eager associative classifier based on information gain. Proof 1 Let maxe be the maximum entropy of all decision tree rules. Select a set Ce from all CARs such that their entropy is at most maxe. It is clear that the decision tree rules are a subset of Ce. Theorem 1 states that, for a given minsup, CARs contain (at least) all information of the corresponding decision tree. Since each decision tree rule may be seem as a CAR, and since all possible CARs were enumerated, then the decision tree can be built by choosing the proper CARs. Theorem 2 CARs perform no worse than decision tree rules, according to the information gain principle. 63

Proof 2 Given an instance to be classified, and, without loss of generality, a decision tree with just pure leaves, the decision tree predicts class c for that instance. We analyze two scenarios: first, just one CAR matches the instance; and s-econd, more than one CAR matches. When just one CAR matches, it is the same as the decision tree rule, since the set of CARs subsumes the set of decision rules. In this case, the - associative classifier and the decision tree make the same prediction. When more than one CAR matches an instance, the prediction may be either the same class (say c) as the matching decision rule or another class. If the associative classifier predicts c then the two approaches are equivalent. In case a class other than c is predicted, by definition, the best matching CAR provides a better information gain than the decision rule, and thus, according to the information gain principle, the CAR will make a better prediction. Theorem 2 states that the additional CARs of the associative classifier that are riot in the decision tree, cannot degrade the classification accuracy. This is because an additional CAR is only used if it is better than all decision rules (according to the information gain principle). However, eager associative classifiers generate a large number of CARs, most of which are useless during classification. For instance, from the set of 13 CARs showed in Figure 4, only 3 match the test instance (the remaining 10 CARs are useless). Next, we present a lazy classifier and compare it to the eager version described in this section. 4.6 Lazy Associative Classifier Unlike the eager associative classifier that extracts a set of ranked CARs from the training data, the lazy associative classifier induces CARs specific to each test instance. The lazy approach projects the training.data, D, only on those features in the test instance, A. From this projected training data, DA, the CARs are induced and I ranked, and the best CAR is used. From the set of all training instances, D, only the instances sharing at least one feature with the test instance A are used to form DA. 64

Then, a rule-set Cl A is generated from DA. Since DA contains only features in A, all CARs generated from DA must match A. The lazy associative classifier is presented in Figure 5. let D be the set of all n training instances let T be the set of all m test instances. 1. for each ti 2 T do 2. let Dti be the projection ofd on features only from ti 3. let Cti be the set of all rules {X! c} mined from Dti 4. sort C 1 ti according to information gain 5. pick the first rule {X! c} 2 C 1 1i, and predict class c Figure 5. Lazy Associative Classifier Now we demonstrate that the lazy associative classifier produces better results than its eager counterpart. Given a test instance A, and a set of CARs C, we denote by CA those CARs {X! c} in C where X A. 4.6.1 Any-label This evaluation technique measures how many times any of the predicted labels of an instance matches. the actual class label in all cases of that instance in the test data. If any of the predicted class labels of an instanced matches the true class label y. 4.6.2 Label-weight This technique enables each predicted label for an instance to play a role in classifying a test case, basedo~ its ranking,a~d therefore it could be considered as a multi-label evaluation measure. An instance may belong to several class labels, each one associated with it by a number of occurrences in the training data. Each class label can be assigned a weight according to how many times that label has been associated with the instance. Let rule rj be associated with a list of ranked labels. 65

We have conducted an extensive perfonnance study to evaluate accuracy and efficiency of CP AR and compare it with that of C4.5 [8], RIPPER [3], CBA [7] and CMAR [6]. As in [7] and [6], 26 datasets from UCI Machine Learning Repository are used. All the experiments are perfonned on a 1.7GHz Pentium-4 PC with 1GB main memory. All the approaches are implemented by their authors. The parameters of CP AR are set as the following. In the rule generation algorithm is set to 0:05, min gain to 0:7, and_ to 2=3. The best 5 rules are used in prediction. Table 1 shows the accuracy of the _ve approaches on 26 datasets from UCI ML Repository. 10-fold cross validation is used for every dataset. Table 2 compares the running (training) time of RIPPER, CMAR (which is claimed to be more e cient than CBA and CP AR on the 26 datasets. Notice that Table 2 uses both arithmetic and geometric average. This is because the running times of di_erent datasets di_er a lot, and the arithmetic average is dominated by the most time-consuming datasets. Using geometric average, equal weight is put on every dataset. Thus we consider geometric average as a more reasonable measure. Table 3 shows the average number of rules used in RIPPER, CMAR and CP AR. 66

Dataset c4.5 ripper cba cmar cpar anneal 94.8 95.8 7.9 97.3 98.4 austral 84.7 87.3 84.9 86.1 86.2 auto 80.1 72.8 78.3 78.1 82.0 breast 95.0 95.1 96.3 96.4 96.0 cleve 78.2 82.2 82.8 82.2 81.5 crx 84.9 84.9 84.7 84.9 85.7 diabetes 74.2 74.7 74.5 75.8 75.1 german 72.3 69.8 73.4 74.9 73.4 glass 68.7 69.1 73.9 70.1 74.4 heart 80.8 80.7 81.9 82.2 82.6 h~atic 80.6 76.7 81.8 80.5 79.4 horse 82.6 84.8 82.1 82.6 84.2 hypo 99.2 98.9 98.9 98.4 98.1 10n0 90.0 91.2 92.3 91.5 92.6.. ms 95.3 94.0 94.7 94.0 94.7 labor 79.3 84.0 86.3 89.7 84.7 led7 73.5 69.7 71.9 72.5 73.6 lymph 73.5 79.0 77.8 83.1 82.3 pima 75.5 73.1 72.9 75.1 73.8 sick 98.5 97.7 97.0 97.5 96.8 sonar 70.2 78.4 77.5 79.4 79.3 tic-tac 99.4 98.0 99.6 99.2 98.6 vehicle 72.6 62.7 68.7 68.8 69.5 waveform 78.1 76.0 80.0 83.2 80.9 wme 92.7 91.6 95.0 95.0 95.5 zoo 92.2 88.1 96.8 97.1 95.1 Average 83.34 82.93 84.69 85.22 85.17 Table 1: Accuracy: C4.5, RIPPER,CBA, CMAR and CPAR RIPPER CMAR CPAR Arithmetic average 0.218 30.24 0.555 Geometric average 0.036 2.877 0.105 Table 2: Running time (in sec.): RIPPER, CMAR and CP AR 6 67

RIPPER CMAR CPAR Arithmetic average 8.20 305 244 Geometric average 5.74 185 106 Table 3: Number of rules: RIPPER, CMAR and CP AR 68