DATA MINING 1DL360. Fall An introductory class in data mining.

Transcription

1 1 DATA MINING 1DL360 Fall 2010 An introductory class in data mining Erik Zeitler UDBL Dept of IT Uppsala university, Sweden

2 2 Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation (Tan, Steinbach, Kumar ch. 4) Erik Zeitler Department of Information Technology Uppsala University, Uppsala, Sweden

3 3 Definition of classification Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

4 10 10 UU - IT - UDBL 4 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Apply Model

5 5 Examples of classification tasks Predict tumor cells as benign or malignant Classify credit card transactions as legitimate or fraudulent Classify secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorize news stories as finance, weather, entertainment, sports, &c

6 6 Classification techniques Nearest Neighbour methods Decision Tree methods Rule-based methods Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

7 7 Instance-based classifiers (ch 5.2) Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training records Use training records to predict the class label of unseen cases Unseen Case Atr1... AtrN

8 8 Instance-based classifiers Examples: Rote-learner Memorizes entire training data and performs classification only if the attributes of a record match one of the training examples exactly Nearest neighbor Uses k closest points (nearest neighbors) for performing classification

9 9 knn classifier intuition If you don t know what you are Look at the nearest ones around you You are probably of the same kind

10 10 Classifying you using knn Each one of you belongs to a group: [F STS IT Int Masters Exchange Other] Classify yourself using 1-NN and 3-NN Look at your k nearest neighbors! How do we select our distance measure? How do we decide which of 1-NN and 3-NN is best?

11 11 Input: A basic knn classifier implementation A test point x The set of known points P Number of neighbors k Output: Class belonging c Implementation: 1. Find the set of k points N P that are nearest to x 2. Count the number of occurrences of each class in N 3. c = class to which the most points in N belong Complexity is O(size(P)) for each tuple to be classified Reduce complexity by using database indexes O(log(size(P))) Rule of thumb: K sqrt(q). Commercial algorithms use a default of 10 Distance function Tie break?

12 12 More knn classifier intuition If it walks and sounds like a duck Then it must be a duck If it walks and sounds like a cow Then it must be a cow

13 13 Walking and talking Assume that a duck has step length 5 15 cm quacks at Hz Assume that a cow has step length is cm moos at Hz

14 14 Cows and ducks in a plot Normalize: subtract mean, divide by stdev subtract min, divide by (max min) stdev 30 stdev 300

15 15 Enter the chicken

16 16 Nearest-Neighbor Classifiers Requires three things The set of stored records Distance metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

17 17 Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

18 18 Nearest Neighbor Classification Compute distance between two points: Minkowski distance (Euclidean: r = 2) M r ( p, q, r) = i ( ) p i q i Determine the class from nearest neighbor list take the majority vote of class labels among the k-nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2 1 r

19 19 Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes

20 20 Nearest Neighbor Classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

21 21 Nearest Neighbor Classification Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results vs d = d = Solution: Normalize the vectors to unit length

22 22 Nearest neighbor classification k-nn classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive Typically O(q), q being database size O(log(q)) if index is utilized, but indexing takes O(qlog(q)) time to build (once and for all) and O(q) additional space

23 23 Decision tree induction (ch 4.3) Many algorithms: Hunt s algorithm (one of the earliest) CART ID3, C4.5 SLIQ, SPRINT

24 10 10 UU - IT - UDBL 24 Decision tree classification task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Tid Attrib1 Attrib2 Attrib3 Class Apply Model 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K?

25 10 UU - IT - UDBL 25 categorical Tid Refund Marital Status Example of a decision tree categorical Taxable Income continuous Cheat class Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO 9 No Married 75K No 10 No Single 90K Yes NO YES Training Data Model: Decision Tree

26 10 UU - IT - UDBL 26 Tid Refund Marital Status Another example of decision tree categorical categorical Taxable Income continuous 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes class Married NO MarSt Yes NO Single, Divorced Refund NO No TaxInc < 80K > 80K YES There could be more than one tree that fits the same data!

27 10 10 UU - IT - UDBL 27 Decision tree classification task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Apply Model Decision Tree

28 10 UU - IT - UDBL 28 General structure of Hunt s algorithm Tid Refund Marital Status Taxable Income Cheat Let D t be the set of training records that reach a node t General Procedure: If D t contains records that belong the same class y t, then t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled by the default class, y d If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes t D t

29 10 UU - IT - UDBL 29 Refund Hunt s algorithm * Yes Don t Cheat No * Tid Refund Marital Status Taxable Income 1 Yes Single 125K No Cheat 2 No Married 100K No Yes Refund Don t Cheat Single, Divorced * No Marital Status Married Don t Cheat Yes Refund Don t Cheat Single, Divorced Taxable Income No Marital Status < 80K >= 80K Married Don t Cheat 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Don t Cheat Cheat

30 30 Tree induction Greedy strategy Split the records based on an attribute test that (locally) optimizes a certain criterion Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

31 31 Tree induction Greedy strategy Split the records based on an attribute test that (locally) optimizes a certain criterion Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

32 32 How to specify test condition? Depends on attribute types Nominal (categories) Ordinal (categories) Continuous (interval split) Depends on number of ways to split 2-way split Multi-way split

33 33 Splitting based on nominal attributes Multi-way split: Use as many partitions as distinct values. Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports}

34 34 Splitting based on ordinal attributes Multi-way split: Use as many partitions as distinct values. Size Small Medium Large Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} OR {Medium, Large} Size {Small} What about this split? {Small, Large} Size {Medium}

35 35 Splitting based on continuous attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing equal frequency bucketing (percentiles) clustering Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut can be more computational intensive

36 36 Splitting based on continuous attributes

37 37 Questions Why divide by d 2 and not by d in vote weighting? In fact, there are different proposals out there; d 1/2, d, d 2,... A higher degree will further suppress votes from distant points Why not compare some different distance weightings in the assignment! Tree based classifiers Find the best split attribute Find the best attribute split boundary Meaningful ( interesting ) splits

38 38 Tree induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

39 39 How to find the best split Before Splitting: C0 C1 N00 N01 M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 C0 N10 C0 N20 C0 N30 C0 N40 C1 N11 C1 N21 C1 N31 C1 N41 M1 M2 M3 M4 M12 Gain = M0 M12 vs M0 M34 M34

40 40 How to determine the best split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?

41 41 How to determine the best split Greedy approach: Child nodes with homogeneous class distribution (maximum discrimination) are preferred Need a measure of node impurity of a data set: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

42 42 Measures of node impurity Gini index GINI ( t) = 1 j ( p( j t) ) 2 Entropy = j ( p( j t ) Entropy ( t) p( j t)log ) Misclassification error Error( t) = 1 max P( i t) i

43 43 Measure of impurity: GINI Gini index for a given node t: GINI ( t) = ( p( j t) ) 0, j n c p( j t) is the relative frequency of class j at node t) Max. when records are equally distributed among all classes least interesting information Min. when all records belong to one class most interesting information C1 0 C2 6 Gini=0.000 C1 1 C2 5 Gini=0.278 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500

44 44 Examples for computing GINI GINI ( t) = 1 j [ p( j t)] 2 C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1) 2 P(C2) 2 = = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6) 2 (5/6) 2 = C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6) 2 (4/6) 2 = 0.444

45 45 Splitting based on GINI When a node p is split into k partitions (children), the quality of split is computed as, GINI split i= 1 where, n i = number of records at child i, n = number of records at node p. = k ni n GINI ( i)

46 46 Binary attributes: computing GINI index Splits into two partitions Effect of weighing: Large and pure partitions B? Yes Parent C1 6 C2 6 Gini = No Node N1 Node N2 Gini(N1) = 1 (5/7) 2 (2/7) 2 = Gini(N2) = 1 (1/5) 2 (4/5) 2 = N1 N2 C1 5 1 C2 2 4 Gini=0.371 Gini(Children) = 7/12 * /12 * = 0.371

47 47 Categorical attributes: computing Gini index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values) CarType Family Sports Luxury C C Gini CarType {Sports, Luxury} {Family} C1 3 1 C2 2 4 Gini CarType {Sports} {Family, Luxury} C1 2 2 C2 1 5 Gini 0.419

48 10 UU - IT - UDBL 48 Continuous attributes: computing Gini index Tid Refund Marital Status Taxable Income Cheat Use Binary Decisions based on one value Several Choices for the splitting value the distinct values in the data are meningful split points Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

49 49 Continuous attributes: computing Gini index... For efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini index Cheat No No No Yes Yes Yes No No No No Taxable Income Sorted Values Split Positions <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes No Gini

50 50 Alternative splitting criteria based on INFO Entropy at a given node t: (NOTE: p( j t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log n c ) when records are equally distributed among all classes least interesting information Minimum (0.0) when all records belong to one class, most interesting information ( p( j t ) Entropy ( t) = p( j t) log ) Entropy based computations are similar to the GINI index computations j

51 51 Examples for computing Entropy Entropy t) = p( j t) log p( j t) j ( 2 C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log log 2 1 = 0 0 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (5/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92

52 52 Information Gain: Splitting based on INFO... k ni GAIN = Entropy( p) Entropy( i) split i= 1 n Parent Node, p is split into k partitions; n i is number of records in partition i Measures reduction in entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

53 53 Gain Ratio: GainRATIO Splitting based on INFO... GAIN Split SplitINFO split = Parent node, p is split into k partitions n i is the number of records in partition i k n i SplitINFO = log i = 1 n Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain n i n

54 54 Splitting criteria based on classification Error Classification error at a node t : Error ( t) = 1 max P( i t) i Measures misclassification error made by a node. Maximum (1-1/n c ) when records are equally distributed among all classes, least interesting information Minimum (0.0) when all records belong to one class, most interesting information

55 55 Examples for computing Error C1 0 C2 6 Error ( t) = 1 max P( i t) P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 max (0, 1) = 1 1 = 0 i C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3

56 Comparison among splitting criteria for a 2-class problem: 56

57 57 Tree induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

58 58 Stopping criteria for tree induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (e.g. to small resulting class set)

59 59 Decision-tree-based classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

60 60 Model evaluation (ch 4.5) Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

61 61 Metrics for performance evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Class=Yes Class=No a: TP (true positive) ACTUAL CLASS Class=Yes Class=No a c b d b: FN (false negative) c: FP (false positive) d: TN (true negative)

62 62 Metrics for performance evaluation PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Most widely-used metric: Accuracy = a + a b + + d c + d = TP + TP TN + + TN FP + FN

63 63 Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

64 64 Cost Matrix PREDICTED CLASS C(i j) Class=Yes Class=No ACTUAL CLASS Class=Yes C(Yes Yes) C(No Yes) Class=No C(Yes No) C(No No) C(i j): Cost of misclassifying class j example as class i

65 65 Errata Slide 46 had wrong GINI computations when shown in-class on Sep 13. Was: Gini(N1) = 1 (5/6) 2 (2/6) 2 = Gini(N2) = 1 (1/6) 2 (4/6) 2 = Should be: Gini(N1) = 1 (5/7) 2 (2/7) 2 = Gini(N2) = 1 (1/5) 2 (4/5) 2 = GINI( t) = 1 j ( p( j t) ) 2

66 66 Binary attributes: computing GINI index Splits into two partitions Effect of weighing: Large and pure partitions B? Yes Parent C1 6 C2 6 Gini = No Node N1 Node N2 Gini(N1) = 1 (5/7) 2 (2/7) 2 = Gini(N2) = 1 (1/5) 2 (4/5) 2 = N1 N2 C1 5 1 C2 2 4 Gini=0.371 Gini(Children) = 7/12 * /12 * = 0.371

67 67 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i j) Model M 1 PREDICTED CLASS Model M 2 PREDICTED CLASS ACTUAL CLASS ACTUAL CLASS Accuracy = 80% Accuracy = 90% Cost = 3910 Cost = 4255

68 68 Cost Count ACTUAL CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes Class=No Class=Yes Cost vs Accuracy PREDICTED CLASS a c PREDICTED CLASS Class=Yes p q Class=No b d Class=No q p Accuracy is proportional to cost if 1. C(Yes No) = C(No Yes) = q 2. C(Yes Yes)= C(No No) = p Proof: N = a + b + c + d Accuracy = (a + d)/n Cost = p (a + d) + q (b + c) = p (a + d) + q (N a d) = q N (q p)(a + d) = N [q (q-p) Accuracy]

69 69 a Precision (p) = a + c a Recall (r) = a + b 2rp F - measure (F) = = r + p Alternative performance metrics proportion of correctly predicted Yes to all predicted Yes proportion of correctly predicted Yes to all actual Yes 2a 2a + b + generalize c harmonic mean of precision and recall Weighted Accuracy = w a 1 + w a 1 w b w d 4 w c + 3 w d 4

70 70 Alternative performance metrics Sensitivity (true pos. rate), TPR = TP / TP + FN Specitivity (true neg. rate), TNR = TN / TN + FP False pos. rate, FPR = FP / TN + FP False neg. rate, FNR = FN / TP + FN Precision (p) = TP / TP + FP Recall, r = TP / TP + FN (= TPR) F1 = 2rp r + p = 2TP 2TP + FP + FN Precision is biased towards C(Yes Yes) & C(Yes No) Recall is biased towards C(Yes Yes) & C(No Yes) F-measure is biased towards all except C(No No)

71 71 Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

72 72 Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling scheme for creating learning curve: Arithmetic sampling (Langley et al, 1996) Geometric sampling (Provost et al, 1999) Effect of small sample size: Bias in the estimate Variance of estimate

73 73 Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k 1 partitions, test on the remaining one Leave-one-out: k = n Stratified sampling partition the population (randomly, or based on attribute values) oversampling: Increase the sampling fraction of rare sub-groups Bootstrap Sampling with replacement