DATA MINING DECISION TREE INDUCTION

Transcription

1 DATA MINING DECISION TREE INDUCTION 1

2 Classification Techniques Linear Models Support Vector Machines Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines 2

3 10 Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K > 80K NO YES Married NO Training Data Model: Decision Tree 3

4 10 Another Decision Tree Example Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Married NO MarSt Yes NO Single, Divorced Refund NO No TaxInc < 80K > 80K YES More than one tree may perfectly fit the data 4

5 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction Tree Induction algorithm 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? Learn Model Apply Model Model Decision Tree 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 5

6 10 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 6

7 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 7

11 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No NO Single, Divorced MarSt Married Assign Cheat to No TaxInc NO < 80K > 80K NO YES 11

12 Decision Tree Terminology 12

13 Decision Tree Induction Many Algorithms: Hunt s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. 13

14 Antenna Length Decision Tree Classifier Abdomen Length Ross Quinlan Abdomen Length > 7.1? no yes Antenna Length > 6.0? Katydid no yes Grasshopper Katydid 14

15 Antennae shorter than body? Yes No 3 Tarsi? Grasshopper Yes No Foretiba has ears? Cricket Yes No Decision trees predate computers Katydids Camel Cricket 15

16 Definition Decision tree is a classifier in the form of a tree structure Decision node: specifies a test on a single attribute Leaf node: indicates the value of the target attribute Arc/edge: split of one attribute Path: a disjunction of test to make the final decision Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node. 16

17 Decision Tree Classification Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes This can also be called supervised segmentation This emphasizes that we are segmenting the instance space Tree pruning Identify and remove branches that reflect noise or outliers 17

18 Decision Tree Representation Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification outlook sunny overcast rain humidity yes wind high normal strong weak no yes no yes 18

19 How do we Construct a Decision Tree? Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divideand-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., info. gain) Why do we call this a greedy algorithm? Because it makes locally optimal decisions (at each node). 19

20 When Do we Stop Partitioning? All samples for a node belong to same class No remaining attributes majority voting used to assign class No samples left 20

21 How to Pick Locally Optimal Split Hunt s algorithm: recursively partition training records into successively purer subsets. How to measure purity/impurity? Entropy and associated information gain Gini Classification error rate Never used in practice but good for understanding and simple exercises 21

22 How to Determine Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Car Type? Student ID? Yes No Family Luxury c 1 c 10 c 20 Sports c 11 C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0... C0: 1 C1: 0 C0: 0 C1: 1... C0: 0 C1: 1 Which test condition is the best? Why is student id a bad feature to use? 22

23 How to Determine Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 Non-homogeneous, High degree of impurity C0: 9 C1: 1 Homogeneous, Low degree of impurity 23

24 Information Theory Think of playing "20 questions": I am thinking of an integer between 1 and 1, what is it? What is the first question you would ask? What question will you ask? Why? Entropy measures how much more information you need before you can identify the integer. Initially, there are 1000 possible values, which we assume are equally likely. What is the maximum number of question you need to ask? 24

25 Entropy Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: Entropy ( S) p1 log 2( p1) p0 log 2( p0) where p 1 is the fraction of positive examples in S and p 0 is fraction of negatives. If all examples are in one category, entropy is zero (we define 0 log(0)=0) If examples are equally mixed (p 1 =p 0 =0.5), entropy is a maximum of 1. For multi-class problems with c categories, entropy generalizes to: Entropy ( S) c i 1 p i log ( p i 2 ) 25

26 Entropy for Binary Classification The entropy is 0 if the outcome is certain. The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible). Entropy of a 2-class problem with regard to the portion of one of the two groups 26

27 Information Gain in Decision Tree Induction Is the expected reduction in entropy caused by partitioning the examples according to this attribute. Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding information that would be gained by branching on A Gain( A) E( Current set) E( all child sets ) The summation in the above formula is a bit misleading since when doing the summation we weight each entropy by the fraction of total examples in the particular child set. This applies to GINI and error rate also. 27

28 Examples for Computing Entropy Entropy t) p( j t)log p( j t) j ( 2 NOTE: p( j t) is computed as the relative frequency of class j at node t C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log log 2 1 = 0 0 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (5/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92 C1 3 C2 3 P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2 Entropy = (1/2) log 2 (1/2) (1/2) log 2 (1/2) = -(1/2)(-1) (1/2)(-1) = ½ + ½ = 1 28

29 How to Calculate log 2 x Many calculators only have a button for log 10 x and log e x ( log typically means log 10 ) You can calculate the log for any base b as follows: log b (x) = log k (x) / log k (b) Thus log 2 (x) = log 10 (x) / log 10 (2) Since log 10 (2) =.301, just calculate the log base 10 and divide by.301 to get log base 2. You can use this for HW if needed 29

30 Splitting Based on INFO... Information Gain: k ni GAIN Entropy( p) Entropy( i) split i 1 n Parent Node, p is split into k partitions; n i is number of records in partition i Uses a weighted average of the child nodes, where weight is based on number of examples Used in ID3 and C4.5 decision tree learners WEKA s J48 is a Java version of C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

31 How Split on Continuous Attributes? For continuous attributes Partition the continuous value of attribute A into a discrete set of intervals Create a new boolean attribute A c, looking for a threshold c One method is to try all possible splits A c true if Ac c false otherwise How to choose c? 31

32 Person Hair Length Weight Age Class Homer M Marge F Bart M Lisa F Maggie F Abe M Selma F Otto M Krusty M Comic ? 32

33 p p n Entropy( S) log 2 log2 p n p n p n n p n yes Hair Length <= 5? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Let us try splitting on Hair length Gain( A) E( Current set) E( all child sets ) Gain(Hair Length <= 5) = (4/9 * /9 * ) =

34 p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes Weight <= 160? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Let us try splitting on Weight Gain( A) E( Current set) E( all child sets ) Gain(Weight <= 160) = (5/9 * /9 * 0 ) =

35 p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes age <= 40? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Let us try splitting on Age Gain( A) E( Current set) E( all child sets ) Gain(Age <= 40) = (6/9 * 1 + 3/9 * ) =

36 Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified So we simply recurse! yes Weight <= 160? no This time we find that we can split on Hair length, and we are done! yes Hair Length <= 2? no 36

37 We don t need to keep the data around, just the test conditions. Weight <= 160? yes no How would these people be classified? Hair Length <= 2? Male yes no Male Female 37

38 It is trivial to convert Decision Trees to rules Weight <= 160? yes no Hair Length <= 2? yes no Male Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female Note: could avoid use of elseif by specifying all test conditions from root to corresponding leaf. 38

39 Once we have learned the decision tree, we don t even need a computer! This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call. Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions. 39

40 The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Yes Wears green? No Female Male For example, the rule Wears green? perfectly classifies the data, so does Mothers name is Jacqueline?, so does Has blue shoes 40

41 GINI is Another Measure of Impurity Gini for a given node t with classes j GINI( t) 1 j [ p( j t)] 2 NOTE: p( j t) is again computed as relative frequency of class j at node t Compute best split by computing the partition that yields the lowest GINI where we again take the weighted average of the children s GINI Best GINI = 0.0 Worst GINI = 0.5 C1 0 C2 6 Gini=0.000 C1 1 C2 5 Gini=0.278 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=

42 Splitting Criteria based on Classification Error Classification error at a node t : Error( t) 1 max P( i t) i Measures misclassification error made by a node. Maximum (1-1/n c ) when records are equally distributed among all classes, implying least interesting information. This is ½ for 2-class problems Minimum (0.0) when all records belong to one class, implying most interesting information 42

43 Examples for Computing Error Error( t) 1 max P( i t) i C1 0 C2 6 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 max (0, 1) = 1 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 Equivalently, predict majority class and determine fraction of errors C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3 43

44 Complete Example using Error Rate C1 0 C2 6 C1 1 C2 5 C1 2 C2 4 Initial sample has 3 C1 and 15 C2 Based on one 3-way split you get the 3 child nodes to the left What is the decrease in error rate? What is the error rate initially? What is it afterwards? As usual you need to take the weighted average (but there is a shortcut) 44

45 Error Rate Example Continued C1 0 C2 6 C1 1 C2 5 C1 2 C2 4 Error rate before: 3/18 Error rate after: Shortcut: Number of errors = Out of 18 examples Error rate = 3/18 Weighted average method: 6/18 x 0 + 6/18 x 1/6 + 6/18 x 2/6 Simplifies to 1/18 + 2/18 = 3/18 45

46 Comparison among Splitting Criteria For a 2-class problem: 46

47 Discussion Error rate is often the metric used to evaluate a classifier (but not always) So it seems reasonable to use error rate to determine the best split That is, why not just use a splitting metric that matches the ultimate evaluation metric? But this is wrong! The reason is related to the fact that decision trees use a greedy strategy, so we need to use a splitting metric that leads to globally better results The other metrics will empirically outperform error rate, although there is no proof for this. 47

48 How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split 48

49 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports} 49

50 Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Small Size Medium Large Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} OR {Medium, Large} Size {Small} What about this split? {Small, Large} Size {Medium} 50

51 Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut can be more compute intensive 51

52 Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split 52

53 Data Fragmentation Number of instances gets smaller as you traverse down the tree Number of instances at the leaf nodes could be too small to make statistically significant decision Decision trees can suffer from data fragmentation Especially true if there are many features and not too many examples True or False: All classification methods may suffer data fragmentation. False: not logistic regression or instance-based learning. Only applies to divide-and-conquer methods 53

54 Expressiveness Expressiveness relates to flexibility of the classifier in forming decision boundaries Linear models are not that expressive since they can only form linear boundaries Decision tree models can form rectangular regions Which is more expressive and why? Decision trees because they can form many regions, but DTs do have the limitation of only forming axis-parallel boundaries. Decision tree do not generalize well to certain types of functions (like parity which depends on all features) For accurate modeling, must have a complete trees Not expressive enough for modeling continuous variables especially when more than one variable at a time is involved 54

55 Decision Boundary x < 0.43? 0.7 Yes No 0.6 y 0.5 y < 0.47? y < 0.33? Yes No Yes No : 4 : 0 : 0 : 4 : 0 : 3 : 4 : x Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time 55

56 Oblique Decision Trees x + y < 1 Class = + Class = This special type of decision tree avoids some weaknesses and increases the expressiveness of decision trees This is not what we mean when we refer to decision trees (e.g., on an exam) 56

57 Tree Replication P Q R S 0 Q S This can be viewed as a weakness of decision trees, but this is really a minor issue 57

58 Pros and Cons of Decision Trees Advantages: Easy to understand Can get a global view of what is going on and also explain individual decisions Can generate rules from them Fast to build and apply Can handle redundant and irrelevant features and missing values Disadvantages: Limited expressive power May suffer from overfitting and validation set may be necessary to avoid overfitting 58

59 More to Come on Decision Trees We have covered most of the essential aspects of decision trees except pruning We will cover pruning next and, more generally, overfitting avoidance We will also cover evaluation, which applies to decision trees but also to all predictive models 59