Classification and Prediction

Similar documents
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining for Knowledge Management. Classification

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Professor Anita Wasilewska. Classification Lecture Notes

Data Mining: Foundation, Techniques and Applications

Classification and Prediction

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining Part 5. Prediction

Data Mining Classification: Decision Trees

Decision Trees from large Databases: SLIQ

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Comparative Analysis of Serial Decision Tree Classification Algorithms

Social Media Mining. Data Mining Essentials

Data mining techniques: decision trees

Data Mining: Concepts and Techniques

Performance Analysis of Decision Trees

Information Management course

A Data Mining Tutorial

Data Mining Algorithms Part 1. Dejan Sarka

Lecture 10: Regression Trees

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Decision-Tree Learning

Introduction to Data Mining

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Rule based Classification of BSE Stock Data with Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining Methods: Applications for Institutional Research

Clustering through Decision Tree Construction in Geology

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Efficient Integration of Data Mining Techniques in Database Management Systems

Data Mining Fundamentals

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad

Generalization and Decision Tree Induction: Efficient Classification in Data Mining

How To Solve The Kd Cup 2010 Challenge

Foundations of Artificial Intelligence. Introduction to Data Mining

CLOUDS: A Decision Tree Classifier for Large Datasets

BIRCH: An Efficient Data Clustering Method For Very Large Databases

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Data Mining Techniques Chapter 6: Decision Trees

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Introduction to Learning & Decision Trees

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining based on Rough Set and Decision Tree Optimization

Data Mining Part 5. Prediction

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining. Session 7 Main Theme Classification and Prediction. Dr. Jean-Claude Franchitti

Web Document Clustering

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A Survey of Classification Techniques in the Area of Big Data.

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining on Streams

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

Classification On The Clouds Using MapReduce

Data Mining Practical Machine Learning Tools and Techniques

Artificial Neural Network Approach for Classification of Heart Disease Dataset

Machine Learning using MapReduce

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Building an Iris Plant Data Classifier Using Neural Network Associative Classification

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Introduction to Spatial Data Mining

COURSE RECOMMENDER SYSTEM IN E-LEARNING

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Preprocessing. Week 2

The KDD Process: Applying Data Mining

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Clustering Via Decision Tree Construction

EFFICIENT CLASSIFICATION OF BIG DATA USING VFDT (VERY FAST DECISION TREE)

COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Chapter 20: Data Analysis

Classification algorithm in Data mining: An Overview

Data Mining Applications in Fund Raising

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Classification Techniques (1)

Chapter 12 Discovering New Knowledge Data Mining

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Customer Classification And Prediction Based On Data Mining Technique

Knowledge Discovery and Data Mining

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Rhodes University COMPUTER SCIENCE HONOURS PROJECT Literature Review. Data Mining with Oracle 10g using Clustering and Classification Algorithms

8. Machine Learning Applied Artificial Intelligence

Transcription:

Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca Han: KDD --- Classification 1

Classification A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur Han: KDD --- Classification 2

Classification Process (1): Model Construction Training Data Classification Algorithms NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes Han: KDD --- Classification 3

Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes (Jeff, Professor, 4) Tenured? Han: KDD --- Classification 4

Supervised vs. Unsupervised Learning Supervised learning (e.g. classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Han: KDD --- Classification 5

Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provded by the model Goodness of rules decision tree size compactness of classification rules Han: KDD --- Classification 6

Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree Han: KDD --- Classification 7

Training Dataset This follows an example from Quinlan s ID3 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 30 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no Han: KDD --- Classification 8

Output: A Decision Tree for buys_computer age? <=30 overcast 30..40 >40 student? yes credit rating? no yes fair excellent no yes no yes Han: KDD --- Classification 9

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left Han: KDD --- Classification 10

Attribute Selection Measure Information gain (ID3/C4.5) All attributes are assumed to be categorical Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes Han: KDD --- Classification 11

Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(t) is defined as n gini( T) = 1 p 2 j j= 1 where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(t) is defined as ( ) N1 ( ) N 2 gini split T = gini T1 + gini( T 2) N N The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). Han: KDD --- Classification 12

Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree Han: KDD --- Classification 13

Classification in Large Databases Classification a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods Han: KDD --- Classification 14

Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT 96 Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB 96 J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB 98 Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB 98 Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) Han: KDD --- Classification 15