Classification & Decision Trees Based in part on slides from textbook, slides of Susan Holmes. October 19, Statistics 202: Data Mining

Transcription

1 Classification & Decision Trees Based in part on slides from textbook, slides of Susan Holmes October 19, / 1

2 Classification Problem description We are given a data matrix X with either continuous or discrete variables such that each row X i F and a set of labels Y L. For a k-class problem, #L = k and we can think of L = {1,..., k}. Our goal is to find a classifier f : F L that allows us to predict the label of a new observation given a new set of features. 2 / 1

3 Classification A supervised problem Classification is a supervised problem. Usually, we use a subset of the data, the training set to learn or estimate the classifier yielding ˆf = ˆf training. The performance of ˆf is measured by applying it to each case in the test set and computing L(ˆf training (X j ), Y j ) j test 3 / 1

4 Classification 4 / 1

5 Classification Examples of classification tasks Predicting whether a tumor is benign or malignant. Classifying credit card transactions as fraudulent or legitimate. Predicting the type of a given tumor among several types. Cateogrizing a document or news story as one of {finance, weather, sports, etc.} 5 / 1

6 Classification Common techniques Decision Tree based Methods Rule-based Methods Discriminant Analysis Memory based reasoning Neural Networks Naïve Bayes Support Vector Machines 6 / 1

7 Classification trees 7 / 1

8 Classification trees 8 / 1

9 Applying a decision tree rule 9 / 1

16 Decision boundary for tree 16 / 1

17 Decision tree for iris data using all features 17 / 1

18 Decision tree for iris data using petal.length, petal.width 18 / 1

19 Regions in petal.length, petal.width plane Petal width Petal length 19 / 1

20 Decision boundary for tree Figure : Trees have trouble capturing structure not parallel to axes 20 / 1

21 Learning the tree 21 / 1

22 Learning the tree Hunt s algorithm (generic structure) Let D t be the set of training records that reach a node t If D t contains records that belong the same class y t, then t is a leaf node labeled as y t. If D t =, then t is a leaf node labeled by the default class, y d. If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. This splitting procedure is what can vary for different tree learning algorithms / 1

23 Learning the tree 23 / 1

24 Learning the tree Issues Greedy strategy: Split the records based on an attribute test that optimizes certain criterion. What is the best split: What criterion do we use? Previous example chose first to split on Refund... How to split the records: Binary or multi-way? Previous example split Taxable Income at 80K... hen do we stop? Should we continue until each node if possible? Previous example stopped with all nodes being completely homogeneous / 1

25 Different splits: ordinal / nominal Figure : Binary or multi-way? 25 / 1

26 Different splits: continuous Figure : Binary or multi-way? 26 / 1

27 Choosing a variable to split on Figure : Which should we start the splitting on? 27 / 1

28 Learning the tree Choosing the best split Need some numerical criterion to choose among possible splits. Criterion should favor homogeneous or pure nodes. Common cost functions: Gini Index Entropy / Deviance / Information Misclassification Error 28 / 1

29 Choosing a variable to split on 29 / 1

30 Learning the tree GINI Index Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). Criterion GINI (t) = p j,t p j,t = 1 (j,j ) {1,...,k}:j j l j=1 Maximized when p j,t = 1/k with value 1 1/k Minimized when all records belong to a single class. Minimizing GINI will favour pure nodes... p 2 j,t. 30 / 1

31 Learning the tree Gain in GINI Index for a potential split Suppose t is to be split into j new child nodes (t l ) 1 l j. Each child node has a count n l and a vector of frequencies (p 1,tl,..., p k,tl ). Hence they have their own GINI index, GINI (t l ). The gain in GINI Index for this split is Gain(GINI, t (t l ) 1 l j ) = GINI (t) j l=1 n lgini (t l ) j l=1 n. l Greedy algorithm chooses the biggest gain in GINI index among a list of possible splits. 31 / 1

32 Decision tree for iris data using all features with GINI 32 / 1

33 Learning the tree Entropy / Deviance / Information Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). Criterion H(t) = k p j,t log p j,t j=1 Maximized when p i,t = 1/k with value log k Minimized when one class has no records in it. Minimizing entropy will favour pure nodes / 1

34 Decision tree for iris data using all features with Entropy 34 / 1

35 Learning the tree Gain in entropy for a potential split Suppose t is to be split into j new child nodes (t l ) 1 l j. Each child node has a count n l and a vector of frequencies (p 1,tl,..., p k,tl ). Hence they have their own entropy H(t l ). The gain in entropy for this split is j l=1 Gain(H, t (t l ) 1 l j ) = H(t) n lh(t l ) j l=1 n. l Greedy algorithm chooses the biggest gain in H among a list of possible splits. 35 / 1

36 Learning the tree Misclassification Error Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). The mode is Criterion ˆk(t) = argmax p k,t. k Misclassification Error(t) = 1 pˆk(t),t Not smooth in p t as GINI, H, can be more difficult to optimize numerically. 36 / 1

37 Different criteria: GINI, H, MC 37 / 1

38 Learning the tree Misclassification Error Example: suppose parent has 10 cases: {7D, 3R} A candidate split produces two nodes: {3D, 0R}, {4D, 3R}. The gain in MC is 0, but gain in GINI is > 0. Similarly, entropy will also show an improvement / 1

39 Choosing the split for a continuous variable 39 / 1

40 Learning the tree Stopping training As trees get deeper, or if splits are multi-way the number of data points per leaf node drops very quickly. Trees that are too deep tend to overfit the data. A common strategy is to prune the tree by removing some internal nodes. 40 / 1

41 Learning the tree Figure : Underfitting corresponds to the left-hand side, overfit to the right 41 / 1

42 Learning the tree Cost-complexity pruning (tree library) Given a criterion Q like H or GINI, we define the cost-complexity of a tree with terminal nodes (t j ) 1 j m C α (T ) = m n j Q(t j ) + αm j=1 Given a large tree T L we might compute C α (T ) for any subtree T of T L. The optimal tree is defined as ˆT α = argmin T T L C α (T ). Can be found by weakest-link pruning. See Elements of Statistical Learning for more / 1

43 Learning the tree Pre-pruning (rpart library) These methods stop the algorithm before it becomes a fully-grown tree. Examples Stop if all instances belong to the same class (kind of obvious). Stop if number of instances is less than some user-specified threshold. Both tree, rpart have rules like this. Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). This relates to cp in rpart. 43 / 1

44 Training and test error as a function of cp 44 / 1

45 Evaluating a classifier Metrics for Performance Evaluation PREDICTED CLASS l Class=Yes Class=Yes a ACTUAL (TP) CLASS Class=No c (FP) Most widely-used metric: Class=No b (FN) d (TN) 45 / 1

46 Evaluating a classifier Measures of performance Simplest is accuracy TP + TN Accuracy = TP + TN + FP + FN = SMC(Actual, Predicted) = 1 Misclassification Rate 46 / 1

47 Evaluating a classifier Accuracy isn t everything Consider an unbalanced 2-class problem with # 1 s=10, # 0 s=9990. Simply labelling everything 0 yields 99.9% accuracy. But, this classifier misses all class / 1

48 Evaluating a classifier Cost Matrix PREDICTED CLASS C(i j) Class=Yes Class=No ACTUAL Class=Yes C(Yes Yes) C(No Yes) CLASS Class=No C(Yes No) C(No No) C(i j): Cost of misclassifying class j example as class i Tan,Steinbach, Kumar Introduction to 4/18/ / 1

49 Learning the tree Measures of performance Classification rule changes to Label(p, C) = argmin i C(i j)p j j Accuracy is the same as cost if C(Y Y ) = C(N N) = c 1, C(Y N) = C(N Y ) = c / 1

50 Evaluating a classifier Computing Cost of Classification Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) Model M 1 ACTUAL CLASS PREDICTED CLASS Model M 2 ACTUAL CLASS PREDICTED CLASS Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 Tan,Steinbach, Kumar Introduction to 4/18/ / 1

51 Evaluating a classifier Measures of performance Other common ones TP Precision = TP + FP TN Specificity = TN + FP = TNR TP Sensitivity = Recall = TP + FN = TPR 2 Recall Precision F = Recall + Precision 2 TP = 2 TP + FN + FP 51 / 1

52 Evaluating a classifier Measures of performance Precision emphasizes P(p = Y, a = Y ) & P(p = Y, a = N). Recall emphasizes P(p = Y, a = Y ) & P(p = N, a = Y ). FPR = 1 TNR FNR = 1 TPR. 52 / 1

53 Evaluating a classifier Measure of performance We have done some simple training / test splits to see how well our classifier is doing. More accurately, this procedure measures how well our algorithm for learning the classifier is doing. How well this works may depend on Model: Are we using the right type of classifier model? Cost: Is our algorithm sensitive to the cost of misclassification? Data size: Do we have enough data to learn a model? 53 / 1

54 Evaluating a classifier Learning Curve l l Learning curve sh how accuracy cha with varying samp Requires a sampli schedule for creat learning curve: l Arithmetic sam (Langley, et a l Geometric sa (Provost et al Effect of small samp - Bias in the es - Variance of e Figure : As data increases, our estimate of accuracy improves, as does the variability of our estimate... Tan,Steinbach, Kumar Introduction to 4/18/ / 1

55 Evaluating a classifier Estimating performance Holdout: Split into test and training (e.g. 1/3 test, 2/3 training). Random subsampling: Repeated replicates of holdout, averaging results. Cross validation: Partition data into K disjoint subsets. For each subset S i, train on all but S i, then test on S i. Stratified sampling: May be helpful to sample so Y/N class is roughly equal in training data Bootstrap: Combine training error and bootstrap error / 1

56 56 / 1