A Data Mining Tutorial



Similar documents
Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Classification and Prediction

Data Mining Classification: Decision Trees

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Information Management course

Introduction. A. Bellaachia Page: 1

Data Mining for Knowledge Management. Classification

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Decision Trees from large Databases: SLIQ

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

How To Make A Credit Risk Model For A Bank Account

Lecture 10: Regression Trees

A Case Study in Knowledge Acquisition for Insurance Risk Assessment using a KDD Methodology

Gerry Hobbs, Department of Statistics, West Virginia University

Foundations of Artificial Intelligence. Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Introduction to Data Mining

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Evolutionary Hot Spots Data Mining. An Architecture for Exploring for Interesting Discoveries

Chapter 12 Discovering New Knowledge Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining: Overview. What is Data Mining?

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad

The Data Mining Process

Introduction to Data Mining

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Perspectives on Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

Chapter 2 Literature Review

A Review of Data Mining Techniques

An Overview and Evaluation of Decision Tree Methodology

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Analytics and Business Intelligence (8696/8697)

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining Analytics for Business Intelligence and Decision Support

Azure Machine Learning, SQL Data Mining and R

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining for Fun and Profit

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining: Foundation, Techniques and Applications

Professor Anita Wasilewska. Classification Lecture Notes

Data Mining Techniques Chapter 6: Decision Trees

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Performance Analysis of Decision Trees

Data Mining Techniques and its Applications in Banking Sector

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Data Mining Methods: Applications for Institutional Research

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Distributed forests for MapReduce-based machine learning

DATA MINING TECHNIQUES AND APPLICATIONS

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Introduction to Learning & Decision Trees

Specific Usage of Visual Data Analysis Techniques

Data Mining Techniques

DATA MINING - SELECTED TOPICS

Data mining techniques: decision trees

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Decision Tree Learning on Very Large Data Sets

DATA MINING METHODS WITH TREES

Implementation of Data Mining Techniques to Perform Market Analysis

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Decision Tree Induction in High Dimensional, Hierarchically Distributed Databases

Introduction to Data Mining Techniques

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

TDS - Socio-Environmental Data Science

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

Data mining and statistical models in marketing campaigns of BT Retail

Weather forecast prediction: a Data Mining application

How To Use Data Mining For Loyalty Based Management

6 Classification and Regression Trees, 7 Bagging, and Boosting

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Advanced Ensemble Strategies for Polynomial Models

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Comparative Analysis of Serial Decision Tree Classification Algorithms

A New Approach for Evaluation of Data Mining Techniques

Classification of Bad Accounts in Credit Card Industry

SPATIAL DATA CLASSIFICATION AND DATA MINING

2015 Workshops for Professors

Using multiple models: Bagging, Boosting, Ensembles, Forests

Decision-Tree Learning

Transcription:

A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen Roberts Copyright c 1998

ACSys Data Mining CRC for Advanced Computational Systems ANU, CSIRO, (Digital), Fujitsu, Sun, SGI Five programs: one is Data Mining Aim to work with collaborators to solve real problems and feed research problems to the scientists Brings together expertise in Machine Learning, Statistics, Numerical Algorithms, Databases, Virtual Environments 1

About Us Graham Williams, Senior Research Scientist with CSIRO Machine Learning Stephen Roberts, Fellow with Computer Sciences Lab, ANU Numerical Methods Markus Hegland, Fellow with Computer Sciences Lab, ANU Numerical Methods 2

Outline Data Mining Overview History Motivation Techniques for Data Mining Link Analysis: Association Rules Predictive Modeling: Classification Predictive Modeling: Regression Data Base Segmentation: Clustering 3

So What is Data Mining? The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets. Extremely large datasets Discovery of the non-obvious Useful knowledge that can improve processes Can not be done manually Technology to enable data exploration, data analysis, and data visualisation of very large databases at a high level of abstraction, without a specific hypothesis in mind. 4

And Where Has it Come From? Machine Learning Database Visualisation High Performance Computers Parallel Algorithms Data Mining Applied Statistics Pattern Recognition 5

Knowledge Discovery in Databases A six or more step process: data warehousing, data selection, data preprocessing, data transformation, data mining, interpretation/evaluation Data Mining is sometimes referred to as KDD DM and KDD tend to be used as synonyms 6

The KDD Treadmill 7

Techniques Used in Data Mining Link Analysis association rules, sequential patterns, time sequences Predictive Modelling tree induction, neural nets, regression Database Segmentation clustering, k-means, Deviation Detection visualisation, statistics 8

Typical Applications of Data Mining Source: IDC 1998 9

Typical Applications of Data Mining Sales/Marketing Provide better customer service Improve cross-selling opportunities (beer and nappies) Increase direct mail response rates Customer Retention Identify patterns of defection Predict likely defections Risk Assessment and Fraud Identify inappropriate or unusual behaviour 10

ACSys Data Mining Mt Stromlo Observatory NRMA Insurance Limited Australian Taxation Office Health Insurance Commission 11

Some Research Interestingness through Evolutionary Computation Virtual Environments Data Mining Standards Temporal Data Mining Spatial Data Mining Feature Selection 12

Outline Data Mining Overview History Motivation Techniques for Data Mining Link Analysis: Association Rules Predictive Modeling: Classification Predictive Modeling: Regression Data Base Segmentation: Clustering 13

Why Data Mining Now? Changes in the Business Environment Customers becoming more demanding Markets are saturated Drivers Focus on the customer, competition, and data assets Enablers Increased data hoarding Cheaper and faster hardware 14

The Growth in KDD Research Community KDD Workshops 1989, 1991, 1993, 1994 KDD Conference annually since 1995 KDD Journal since 1997 ACM SIGKDD http://www.acm.org/sigkdd Commercially Research: IBM, Amex, NAB, AT&T, HIC, NRMA Services: ACSys, IBM, MIP, NCR, Magnify Tools: TMC, IBM, ISL, SGI, SAS, Magnify 15

Outline Data Mining Overview History Motivation Techniques for Data Mining Link Analysis: Association Rules Predictive Modeling: Classification Predictive Modeling: Regression Data Base Segmentation: Clustering 16

The Scientist s Motivation The Real World Offers many challenging problems Enormous databases now exist and readily available Statistics building models and doing analysis for years? Statistics limited computationally Relevance of statistics if we do not sample There are not enough statisticians to go around! Machine Learning to build models? Limited computationally, useful on toy problems, but... 17

Motivation: The Sizes Databases today are huge: More than 1,000,000 entities/records/rows From 10 to 10,000 fields/attributes/variables Giga-bytes and tera-bytes Databases a growing at an unprecendented rate The corporate world is a cut-throat world Decisions must be made rapidly Decisions must be made with maximum knowledge 18

Motivation for doing Data Mining Investment in Data Collection/Data Warehouse Add value to the data holding Competitive advantage More effective decision making OLTP =) Data Warehouse =) Decision Support Work to add value to the data holding Support high level and long term decision making Fundamental move in use of Databases 19

Another Angle: The Personal Data Miner The Microsoft Challenge Information overload Internet navigation Intelligent Internet catalogues 20

Outline Data Mining Overview History Motivation Techniques for Data Mining Link Analysis: Association Rules Predictive Modeling: Classification Predictive Modeling: Regression Data Base Segmentation: Clustering 21

Data Mining Operations Link Analysis links between individuals rather than characterising whole Predictive Modelling (supervised learning) use observations to learn to predict Database Segmentation (unsupervised learning) partition data into similar groups 22

Outline Data Mining Overview History Motivation Techniques for Data Mining Link Analysis: Association Rules Predictive Modeling: Classification Predictive Modeling: Regression Data Base Segmentation: Clustering 23

Link Analysis: Association Rules A technique developed specifically for data mining Given A dataset of customer transactions A transaction is a collection of items Find Correlations between items as rules Examples Supermarket baskets Attached mailing in direct marketing 24

Determining Interesting Association Rules Rules have confidence and support IF x and y THEN z with confidence c if x and y are in the basket, then so is z in c% of cases IF x and y THEN z with support s the rule holds in s% of all transactions 25

Example Transaction Items 12345 A B C 12346 A C 12347 A D 12348 B E F Input Parameters: confidence = 50%; support = 50% if A then C: c = 66.6% s = 50% if C then A: c = 100% s = 50% 26

Typical Application Hundreds of thousands of different items Millions of transactions Many gigabytes of data It is a large task, but linear algorithms exist 27

Itemsets are Basis of Algorithm Transaction Items 12345 A B C 12346 A C 12347 A D 12348 B E F Itemset Support 75% A 50% B 50% C C 50% A; Rule A ) C s=s(a; C) = 50% c=s(a; C)/s(A) = 66.6% 28

Algorithm Outline Find all large itemsets sets of items with at least minimum support Apriori and AprioriTid and newer algorithms Generate rules from large itemsets For ABCD and AB in large itemset the rule ) AB CD holds if ratio s(abcd)/s(ab) is large enough This ratio is the confidence of the rule 29

HIC Example Associations on episode database for pathology services 6.8 million records X 120 attributes (3.5GB) 15 months preprocessing then 2 weeks data mining Goal: find associations between tests cmin = 50% and smin = 1%, 0.5%, 0.25% (1% of 6.8 million = 68,000) Unexpected/unnecessary combination of services Refuse cover saves $550,000 per year 30

Psuedo Algorithm (1) F1 = f frequent 1-item-sets g (2) for (k = 2; F k,1 6= ;; k + +) do begin (3) C k = apriori gen(f k,1) (4) for all transactions t 2 T (5) subset(c k ;t) (6) F k = fc 2 C k jc.count minsupg (7) end (8)Answer= S F k 31

Parallel Algorithm: Count Distribution Algorithm Each processor works (and stores) complete set of Candidates and produces local support counts for local transactions Global support counts obtained via a global reduction operation Good scalability when working with small numbers of candidates (large support), but unable to deal with large number of candidates (small support). [Agrawal & Shafer 96] 32

Parallel Algorithm: Data Distibution Algorithm Each processor computes support counts for only jc k j=p candidates. Need to move transaction data between processors via all to all communication Able to deal with large numbers of candidates, but speedups not as good as Count Distribution Algorithm for large transaction data size [Agrawal & Shafer 96] 33

Improved Parallel Algorithm: Intelligent Data Distribution Uses more efficient inter-processor communication scheme: point to point Switches to Count Distribution when when total number of candidate itemsets fall below a given threshold The candidate itemsets are distributed among the processors so that each processor gets itemsets that begin only with a subset of all possible items [Han, Karypis & Kumar 97] 34

Improved Parallel Algorithm: Hybrid Algorithm Combines Count Distribution Algorithm and the Intelligent Data Distribution Algorithm Data divided evenly between processors Processors divided into groups In each group Intelligent Data Distribution Algorithm is run Each group supplies local support counts, ala the Count Distribution Algorithm [Han, Karypis & Kumar 97] 35

Outline Data Mining Overview History Motivation Techniques for Data Mining Link Analysis: Association Rules Predictive Modeling: Classification Predictive Modeling: Regression Data Base Segmentation: Clustering 36

Predictive Modelling: Classification Goal of classification is to build structures from examples of past decisions that can be used to make decisions for unseen cases. Often referred to as supervised learning. Decision Tree and Rule induction are popular techniques Neural Networks also used 37

Classification: C5.0 Quinlan: ID3 =) C4.5 =) C5.0 Most widely used Machine Learning and Data Mining tool Started as Decision Tree Induction, now Rule Induction, also Available from http://www.rulequest.com/ Many publically available alternatives CART developed by Breiman et al. (Stanford) Salford Systems http://www.salford-systems.com 38

Decision Tree Induction Decision tree induction is an example of a recursive partitioning algorithm Basic motivation: A dataset contains a certain amount of information A random dataset has high entropy Work towards reducing the amount of entropy in the data Alternatively, increase the amount of information exhibited by the data 39

Algorithm 40

Algorithm Construct set of candidate partitions S Select best S* in S Describe each cell C i in S* Test termination condition on each C i true: form a leaf node false: recurse C with i as new training set 41

Discriminating Descriptions Typical algorithm considers a single attribute at one time: categorical attributes define a disjoint cell for each possible value: sex = male can be grouped: 2 transport (car, bike) continuous attributes define many possible binary partitions A < 24 Split A 24 and Or A < 28 split A 28 and 42

Information Measure Estimate the gain in information from a particular partitioning of the dataset A decision tree produces a message which is the decision The information content is P m j=1,p j log(p j ) p j is the probability of making a particular decision there m are possible decisions Same as entropy: P m j=1 p j log(1=p j ). 43

Information Measure info(t ) = P m j=1,p j log(p j ) is the amount of information needed to identify class of an object in T Maximised when all p j are equal Minimised (0) when all but one p j is 0 (the remaining p j is 1) Now partition the data into n cells Expected information requirement is then the weighted sum: n ) x = jt ij i=1 (T P info jt j info(t i) 44

Information Measure The information that is gained by partitioning T is then: gain(a) = info(t ), info x (T ) This gain criterion can then be used to select the partition which maximises information gain Variations of the Information Gain have been developed to avoid various biases: Gini Index of Diversity 45

End Result b A c E <63 >=63 Y Y N 46

Types of Parallelism Inter-node Parallelism: same time multiple nodes processed at the Inter-Attribute-Evaluation Parallelism: where candidate attributes in a node are distibuted among processors Intra-Attribute-Evaluation Parallelism: where the calculation for a single attribute is distributed between processors 47

Example: ScalParC Data Structures Attribute Lists: separate lists of all attributes, distributed across processors Node Table: Stores node information for each record id Count Matrices: stored for each attribute, for all nodes at a given level [Joshi, Karypis & Kumar 97] 48

Outline of ScalParC Algorithm (1) Sort Continuous Attributes (2) do while (there are nodes requiring splitting at current level) (3) Compute count matrices (4) Compute best index for nodes requiring split (5) Partition splitting attributes and update node table (6) Partition non-splitting attributes (7) end do 49

Pruning We may be able to build a decision tree which perfectly reflects the data But the tree may not be generally applicable called overfitting Pruning is a technique for simplifying and hence generalising a decision tree 50

Error-Based Pruning Replace sub-trees with leaves Decision class is the majority Pruning based on predicted error rates prune subtrees which result in lower predicted error rate 51

Pruning How to estimate error? Use a separate test set: Error rate on training set (resubstitution error) not useful because pruning will always increase error Two common techniques are cost-complexity pruning and reduced-error pruning Cost Complexity Pruning: Predicted error rate modelled as weighted sum of complexity and error on training set the test cases used to determine weighting Reduced Error Pruning: Use test set to assess error rate directly 52

Issues Unknown attribute values Run out of attributes to split on Brittleness of method small perturbations of data lead to significant changes in decision trees Trees become too large and are no longer particularly understandable (thousands of nodes) Data Mining: Accuracy, alone, is not so important 53

Classification Rules A tree can be converted to a rule set by traversing each path b A c E <63 >=63 Y Y N A = c ) Y A = b ^ E < 63 ) Y A = b ^ E 63 ) N Rule Pruning: Perhaps E 63 ) N 54

Pros and Cons of Decision Tree Induction Pros Greedy Search = Fast Execution High dimensionality not a problem Selects important variables Creates symbolic descriptions Cons Search space is huge Interaction terms not considered Parallel axis tests (A = only v) 55

Recent Research Bagging Sample with resubstitution from training set Build multiple decision trees from different samples Use a voting method to classify new objects Boosting Build multiple trees from all training data Maintain a weight for each instance in the training set that reflects its importance Use a voting method to classify new objects 56