Data Mining: Foundation, Techniques and Applications

Similar documents

Classification and Prediction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining for Knowledge Management. Classification

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining: An Overview. David Madigan

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Professor Anita Wasilewska. Classification Lecture Notes

Social Media Mining. Data Mining Essentials

Data Mining Classification: Decision Trees

Decision-Tree Learning

Data Mining Part 5. Prediction

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Data Mining on Streams

Data Mining Apriori Algorithm

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Decision Trees from large Databases: SLIQ

Data mining techniques: decision trees

Foundations of Artificial Intelligence. Introduction to Data Mining

Classification and Prediction

Lecture 10: Regression Trees

Chapter 12 Discovering New Knowledge Data Mining

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining Algorithms Part 1. Dejan Sarka

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Chapter 7. Cluster Analysis

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Building A Smart Academic Advising System Using Association Rule Mining

Knowledge Discovery and Data Mining

Introduction to Spatial Data Mining

Introduction to Data Mining

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Data Mining Essentials

Data Mining Practical Machine Learning Tools and Techniques

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Rule based Classification of BSE Stock Data with Data Mining

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

A Data Mining Tutorial

Association Rule Mining

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Scoring the Data Using Association Rules

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Information Management course

Data Preprocessing. Week 2

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Big Data: The Science of Patterns. Dr. Lutz Hamel Dept. of Computer Science and Statistics

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A comparison of various clustering methods and algorithms in data mining

Gerry Hobbs, Department of Statistics, West Virginia University

Association Rule Mining: A Survey

A Comparative Study of clustering algorithms Using weka tools

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining with Weka

Chapter 20: Data Analysis

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Performance Analysis of Decision Trees

Introduction to Learning & Decision Trees

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Clustering methods for Big data analysis

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Machine Learning using MapReduce

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining Techniques Chapter 6: Decision Trees

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Unsupervised Data Mining (Clustering)

The KDD Process: Applying Data Mining

Building an Iris Plant Data Classifier Using Neural Network Associative Classification

Clustering UE 141 Spring 2013

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

8. Machine Learning Applied Artificial Intelligence

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining for Retail Website Design and Enhanced Marketing

Clustering Via Decision Tree Construction

Classification Techniques (1)

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

{ Mining, Sets, of, Patterns }

Data Mining Methods: Applications for Institutional Research

Fuzzy Association Rules

Chapter 6: Episode discovery process

Transcription:

Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李翠平 ) School of Information Renmin University of China Anthony Tung( 鄧锦浩 ) School of Computing National University of Singapore Data Mining: Foundation, Techniques and Applications 1

Why a quick overview? Have a look at the general techniques before looking at the foundation. Machine Learning & Statistics Indexing Pre-Computation Easier to explain and see how these foundation support the various techniques Data Mining: Foundation, Techniques and Applications 2

Outline Association Rules The Apriori Algorithm Clustering Partitioning Density Based Classification Decision Trees A General Framework for DM Data Mining: Foundation, Techniques and Applications 3

Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up?) Attached mailing in direct marketing Data Mining: Foundation, Techniques and Applications 4

Rule Measures: Support and Confidence Customer buys beer Customer buys both Customer buys diaper Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X ^ Y ^ Z} confidence, c, conditional probability that a transaction having {X ^ Y} also contains Z Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%) Data Mining: Foundation, Techniques and Applications 5

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {a b} is a frequent itemset, both {a} and {b} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules. Data Mining: Foundation, Techniques and Applications 6

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do L k+1 increment the count of all candidates in C k+1 = candidates in C k+1 with min_support end return k L k ; that are contained in t Data Mining: Foundation, Techniques and Applications 7

The Apriori Algorithm Bottom-up, breadth first search Only read is perform on the databases Store candidates in memory to simulate the lattice search Iteratively follow the two steps: generate candidates count and get actual frequent items a,b a,b,c,e a,b,c a,b,e a,c,e b,c,e a,c a,e b,c b,e a b c e start {} c,e Data Mining: Foundation, Techniques and Applications 8

Candidate Generation and Pruning Suppose all frequent (k-1) items are in L k-1 Step 1: Self-joining L k-1 insert into C k select p.i 1, p.i 2,, p.i k-1, q.i k-1 from L k-1 p, L k-1 q where p.i 1 =q.i 1,, p.i k-2 =q.i k-2, p.i k-1 < q.i k-1 Step 2: pruning forall itemsets c in C k do a,b,c,e forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k a,b,c a,b,e a,c,e b,c,e a,b a,c a,e b,c b,e c,e Data Mining: Foundation, Techniques and Applications 9

Candidate Generation and Pruning(another example) a,b,c,e a,b,c a,b,e a,c,e b,c,e Data Mining: Foundation, Techniques and Applications 10

Counting Supports Stored candidates itemsets in a trier structure for support counting 10: a,b,c,e,f 10: a,b,c,e,f 10: a,b,c,e,f a {} b TID Items 10 a, b, c, e, f 20 b, c, e 30 b, f, g 40 b, e.... 10: a,b,c,e,f b e e c e f f 01 0 0 0 Candidate Itemsets: {a,b,c}, {a,b,e}, {a,e,f}, {b,e,f} Data Mining: Foundation, Techniques and Applications 11

Counting Supports Stored candidates itemsets in a trier structure for support counting 10: a,b,c,e,f 10: a,b,c,e,f 10: a,b,c,e,f a {} b TID Items 10 a, b, c, e, f 20 b, c, e 30 b, f, g 40 b, e.... 10: a,b,c,e,f b e e c e f f 01 01 0 0 Candidate Itemsets: {a,b,c}, {a,b,e}, {a,e,f}, {b,e,f} Data Mining: Foundation, Techniques and Applications 12

Counting Supports Stored candidates itemsets in a trier structure for support counting 10: a,b,c,e,f 10: a,b,c,e,f 10: a,b,c,e,f a {} b TID Items 10 a, b, c, e, f 20 b, c, e 30 b, f, g 40 b, e.... b e e c e f f 01 1 0 0 Candidate Itemsets: {a,b,c}, {a,b,e}, {a,e,f}, {b,e,f} Data Mining: Foundation, Techniques and Applications 13

Counting Supports Stored candidates itemsets in a trier structure for support counting 10: a,b,c,e,f 10: a,b,c,e,f a {} b TID Items 10 a, b, c, e, f 20 b, c, e 30 b, f, g 40 b, e.... b e e c e f f 01 1 0 0 Candidate Itemsets: {a,b,c}, {a,b,e}, {a,e,f}, {b,e,f} Data Mining: Foundation, Techniques and Applications 14

Counting Supports Stored candidates itemsets in a trier structure for support counting 10: a,b,c,e,f 10: a,b,c,e,f a {} b TID Items 10 a, b, c, e, f 20 b, c, e 30 b, f, g 40 b, e.... b e e c e f f 01 1 0 0 Candidate Itemsets: {a,b,c}, {a,b,e}, {a,e,f}, {b,e,f} Data Mining: Foundation, Techniques and Applications 15

Counting Supports Stored candidates itemsets in a trier structure for support counting 10: a,b,c,e,f 10: a,b,c,e,f 10: a,b,c,e,f a {} b TID Items 10 a, b, c, e, f 20 b, c, e 30 b, f, g 40 b, e.... 10: a,b,c,e,f b e e c e f f 01 1 10 0 Candidate Itemsets: {a,b,c}, {a,b,e}, {a,e,f}, {b,e,f} Data Mining: Foundation, Techniques and Applications 16

Rules Generation Since the support of all the frequent itemsets are known, it is possible to derive all rules that satisfied the MINCONF threshold by making use of the support computed. Eg. If supp({a,b,c,d})=20 and supp({a,b})=50 then confidence for the rule {a,b=>c,d} is 20/50 = 40%. Data Mining: Foundation, Techniques and Applications 17

Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Database D TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2 Scan D Scan D 3-candidates Itemset bce Freq 3-itemsets Itemset Sup bce 2 1-candidates Itemset Sup a 2 b 3 c 3 d 1 e 3 Freq 2-itemsets Itemset ac bc be ce Sup 2 2 3 2 Freq 1-itemsets Itemset Sup a 2 b 3 c 3 e 3 Counting Itemset ab ac ae bc be ce Sup 1 2 1 2 3 2 2-candidates Itemset ab ac ae bc be ce Scan D Data Mining: Foundation, Techniques and Applications 18

Outline Association Rules The Apriori Algorithm Clustering Partitioning Density Based Classification Decision Trees A General Framework for DM Data Mining: Foundation, Techniques and Applications 19

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Data Mining: Foundation, Techniques and Applications 20

What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Data Mining: Foundation, Techniques and Applications 21

The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment. Data Mining: Foundation, Techniques and Applications 22

The K-Means Clustering Method Example 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Data Mining: Foundation, Techniques and Applications 23

Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD 96) OPTICS: Ankerst, et al (SIGMOD 99). DENCLUE: Hinneburg & D. Keim (KDD 98) CLIQUE: Agrawal, et al. (SIGMOD 98) Data Mining: Foundation, Techniques and Applications 24

Density-Based Clustering: Background Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Epsneighbourhood of that point N Eps (p): {q belongs to D dist(p,q) <= Eps} Directly density-reachable: A point p is directly densityreachable from a point q wrt. Eps, MinPts if 1) p belongs to N Eps (q) 2) core point condition: N Eps (q) >= MinPts q p MinPts = 5 Eps = 1 cm Data Mining: Foundation, Techniques and Applications 25

Density-Based Clustering: Background (II) Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p 1,, p n, p 1 = q, p n = p such that p i+1 is directly density-reachable from p i q p 1 p Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p o q Data Mining: Foundation, Techniques and Applications 26

DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of densityconnected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Core Eps = 1unit MinPts = 5 Data Mining: Foundation, Techniques and Applications 27

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. Data Mining: Foundation, Techniques and Applications 28

Outline Association Rules The Apriori Algorithm Clustering Partitioning Density Based Classification Decision Trees A General Framework for DM Data Mining: Foundation, Techniques and Applications 29

What is Classification? Classification: predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Data Mining: Foundation, Techniques and Applications 30

Classification A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur Data Mining: Foundation, Techniques and Applications 31

Classification Process (1): Model Construction Training Data Classification Algorithms NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes Data Mining: Foundation, Techniques and Applications 32

Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes (Jeff, Professor, 4) Tenured? Data Mining: Foundation, Techniques and Applications 33

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Data Mining: Foundation, Techniques and Applications 34

Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Data Mining: Foundation, Techniques and Applications 35

Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High No Normal Yes Each branch corresponds to an attribute value node Each leaf node assigns a classification Data Mining: Foundation, Techniques and Applications 36

Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No Data Mining: Foundation, Techniques and Applications 37

Decision Tree for PlayTennis Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak? No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Data Mining: Foundation, Techniques and Applications 38

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left Data Mining: Foundation, Techniques and Applications 39

Which Attribute is best? [29+,35-] A 1 =? A 2 =? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] Data Mining: Foundation, Techniques and Applications 40

Entropy S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p + log 2 p + -p - log 2 p - Data Mining: Foundation, Techniques and Applications 41

Entropy Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign log 2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p + log 2 p + -p - log 2 p - Data Mining: Foundation, Techniques and Applications 42

Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - v values(a) S v / S Entropy(S v ) Entropy([29+,35-]) = -29/64 log 2 29/64 35/64 log 2 35/64 = 0.99 [29+,35-] A 1 =? A 2 =? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] Data Mining: Foundation, Techniques and Applications 43

Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A 2 )=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 [29+,35-] A 1 =? A 2 =? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] Data Mining: Foundation, Techniques and Applications 44

Selecting the Next Attribute S=[9+,5-] E=0.940 Humidity S=[9+,5-] E=0.940 Wind High Normal Weak Strong [3+, 4-] [6+, 1-] E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 (7/14)*0.592 =0.151 E=0.592 [6+, 2-] [3+, 3-] E=0.811 E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 (6/14)*1.0 =0.048 Data Mining: Foundation, Techniques and Applications 45

Selecting the Next Attribute S=[9+,5-] E=0.940 Outlook Sunny Over cast Rain [2+, 3-] [4+, 0] [3+, 2-] E=0.971 E=0.0 E=0.971 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 (5/14)*0.0971 =0.247 Data Mining: Foundation, Techniques and Applications 46

ID3 Algorithm [D1,D2,,D14] [9+,5-] Outlook Sunny Overcast Rain S sunny =[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-]? Yes? [D4,D5,D6,D10,D14] [3+,2-] Gain(S sunny, Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970 Gain(S sunny, Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570 Gain(S sunny, Wind)=0.970= -(2/5)1.0 3/5(0.918) = 0.019 Data Mining: Foundation, Techniques and Applications 47

ID3 Algorithm Outlook Sunny Overcast Rain Humidity Yes [D3,D7,D12,D13] Wind High Normal Strong Weak No Yes No Yes [D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10] Data Mining: Foundation, Techniques and Applications 48

Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree Data Mining: Foundation, Techniques and Applications 49

Outline Association Rules The Apriori Algorithm Clustering Partitioning Density Based Classification Decision Trees A General Framework for DM Data Mining: Foundation, Techniques and Applications 50

A Generalized View of DM 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory) Data Mining: Foundation, Techniques and Applications 51

Can we fit what we learn into the framework? Apriori K-means ID3 task structure of the model or pattern search space score function search /optimization method data management technique rule pattern discovery association rules lattice of all possible combination of items size= 2 m support, confidence breadth first with pruning TBD clustering clusters choice of any k points as center size=infinity square error gradient descent TBD classification decision tree all possible combination of decision tree size= potentially infinity accuracy, information gain greedy TBD Data Mining: Foundation, Techniques and Applications 52