Classification 1. Jun Du The University of Western Ontario

Transcription

1 Classification 1 Jun Du The University of Western Ontario jdu43@uwo.ca

2 Outline Supervised Learning: Classification vs Regression Decision Tree (Classification) Naïve Bayes (Classification) Instance-Based Classifiers --- KNN (Classification) 1

3 Supervised Learning Given a collection of examples/instances (training set) Each example contains a set of attributes/features, one of the attributes is the class/label. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen examples should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. The given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2

4 10 10 Illustrating Supervised Learning Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Learning algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 3

5 2 Types of Supervised Learning There are usually 2 types of supervised learning tasks Classification Regression Classification: To predict discrete / nominal value Regression: To predict continuous / numeric value Although the difference seems to be insignificant, the models (and the techniques to build the models) are totally different. 4

6 10 10 Recall Illustration What makes the difference between classification and regression in the illustration? Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Learning algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 5

7 Examples of Different Tasks Predict tumor cells as benign or malignant Predict tumor size Predict credit card transactions as legitimate or fraudulent Predict credit score Predict whether a cell phone customer will switch to other telecommunication company Predict how much profit a cell phone customer can bring 6

8 Classification: Decision Tree Naïve Bayes K Nearest Neighbor Neural Networks Support Vector Machines Ensemble Methods Regression Linear Regression Algorithms 7

10 10 Example of Decision Tree Tid Refund Marital Status Taxable Income Cheat Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO 9 No Married 75K No 10 No Single 90K Yes NO YES Training Data Model: Decision Tree 9

11 10 Another Example Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No Cheat Married NO MarSt Yes Single, Divorced Refund No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes 6 No Married 60K No NO YES 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes There could be more than one tree that fits the same data! 10

12 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 11

13 10 Apply Model to Test Data (1) Basic idea: start from the root of tree, follow the corresponding branches, and reach an external node Refund Test Data Yes NO No MarSt Single, Divorced Married Refund Marital Status Taxable Income No Married 80K? Cheat TaxInc < 80K > 80K NO NO YES 12

14 10 Apply Model to Test Data (2) Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 13

18 10 Apply Model to Test Data (6) Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married Assign Cheat to No TaxInc NO < 80K > 80K NO YES 17

19 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 18

20 Decision Tree Induction Many Algorithms: ID3 C4.5 C5 CART 19

21 10 Basic Algorithm Yes Don t Cheat Don t Cheat Refund Single, Divorced Y: 3 N: 1 Y: 3 N: 7 Cheat Y: 0 N: 3 No Marital Status Yes Don t Cheat Refund Married Don t Cheat Y: 0 N: 3 No Don t Cheat Yes Don t Cheat Y: 3 N: 4 Refund Single, Divorced Taxable Income No Marital Status < 80K >= 80K Married Don t Cheat Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Don t Cheat Cheat 20

22 Key issue: Tree Induction Determine which feature to select to build the tree Short answer: Design a tree splitting criterion Select the feature that best meets the criterion to expand the tree Greedy strategy --- based on certain criterion, greedily select features to split examples. 21

23 Which Attribute to Select -- Intuition Before Splitting: 10 examples of class 0 10 examples of class 1 Which split is better? Why? 22

24 Splitting Criterion Greedy approach: Nodes with homogeneous / pure class distribution are preferred Need a measure of node purity: C0: 5 C1: 5 Non-homogeneous Low degree of purity C0: 9 C1: 1 Homogeneous High degree of purity 23

25 Which Attribute to Select --- Intuition 24

26 Measures of Node Purity Entropy (ID3 / C4.5) Information Gain Gain Ratio Gini Index (CART) Misclassification Error 25

27 Computing Entropy (1) Formula for computing the entropy: entropy( p1, p2,, pn) p1log p1 p2logp2 p n logp n p(yes) = 0.5; p(no) = 0.5 entropy (p(yes), p(no)) = 0.5 log(0.5) 0.5 log(0.5) = 1 26

28 Computing Entropy (2) Formula for computing the entropy: entropy( p1, p2,, pn) p1log p1 p2logp2 p n logp n p(yes) = 1; p(no) = 0 entropy (p(yes), p(no)) = 1 log(1) 0 log(0) = 0 Pure Node Low Entropy Good; Impure Node High Entropy Bad 27

29 Example: Attribute Outlook Outlook = Sunny : info([2,3] ) Outlook = Overcast : info([4,0] ) Outlook = Rainy : entropy(2/ 5,3/5) entropy(1, 0) info([3,2] ) entropy(3/ 5,2/5) Expected info for Outlook --- Weighted Sum info([3,2],[4,0],[3,2]) (5/14) (4/14) 0 (5/14)

30 Computing information gain Information gain: (information before split) (information after split) gain(outlo ok) info([9,5])-info([2,3],[4,0],[3,2]) Intuitively, information gain refers to how much information can be gained by selecting the corresponding attribute to split the data High information gain Good Low information gain Bad 29

31 Example: Attribute Humidity Humidity = High : info([3,4] ) Humidity = Normal : entropy(3/ 7,4/7) info([6,1] ) entropy(6/ 7,1/7) Expected information for Humidity : info([3,4],[6,1]) Information Gain: (7/14) (7/14) info([9,5] ) -info([3,4],[6,1])

32 Select the Attribute Information gain for all attributes from weather data: gain(" Outlook") gain(" Humidity") gain(" Temperature") gain(" Windy") Outlook is selected to build the tree. What s next? 31

33 Continuing to split gain(" Temperature") gain(" Windy") gain(" Humidity")

34 Final Decision Tree Splitting stops when data can t be split any further All examples in the same node belong to the same class No new attribute can be used to split the data (when could it happen?) (or early termination; to be discussed later) 33

35 Not Done Yet Example: ID Outlook Temperature Humidity Windy Play? A sunny hot high false No B sunny hot high true No C overcast hot high false Yes D rain mild high false Yes E rain cool normal false Yes F rain cool normal true No G overcast cool normal true Yes H sunny mild high false No I sunny cool normal false Yes J rain mild normal false Yes K sunny mild normal true Yes L overcast mild high true Yes M overcast hot normal false Yes N rain mild high true No 34

36 Split for ID Code Attribute Entropy of each leaf node = 0 Since each leaf node is pure, having only one case. Information gain is maximized for ID Code The final tree will be like the above. Is it good? Why? 35

37 Limitation of Information Gain Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values Solution Gain Ratio 36

38 Gain Ratio Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the Split-Info of a split into account 37

39 Computing Intrinsic Information Split-Info: entropy of distribution of instances into branches. Example: ID Code Split_Info for ID code: split_info ("ID_code") entropy( 1, ,, 1 ) ( 1 log 14 1 )

40 Computing Gain Ratio Gain Ratio Formula: gain_ratio ("Attribute") gain(" Attribute") split_info("attribute") Example: ID Code Gain Ratio for ID code: gain_ratio ("ID_code") gain(" ID_code") split_info("id_code")

41 Gain Ratios for Weather Data Outlook Temperature Info: Info: Gain: Gain: Split info: info([5,4,5]) Split info: info([4,6,4]) Gain ratio: 0.247/ Gain ratio: 0.029/ Humidity Windy Info: Info: Gain: Gain: Split info: info([7,7]) Split info: info([8,6]) Gain ratio: 0.152/ Gain ratio: 0.048/

42 More on Gain Ratio Outlook comes out top (among the 4 attributes) However: ID code still has greater gain ratio Gain Ratio alleviates the bias, but still will select ID code Standard fix: ad hoc test to prevent splitting on that type of attribute 41

43 Recall Measures of Node Purity Entropy (ID3 / C4.5) Information Gain Gain Ratio Gini Index (CART) Misclassification Error 42

44 Comparison among Splitting Criteria For a 2-class problem: 43

45 Summary So Far We know what a decision tree model is. We know how to make predictions by using a decision tree model. We know how to build a decision tree model (based on different splitting criteria) Information gain Gain ratio 44

46 Questions Is it possible that a decision tree has 100% accuracy in the training data? If it is possible, in what situation it will happen, and in what situation it won t happen? If it is not possible, why? Is it good for a predictive model? 45

47 Recall: When to Stop Splitting? Splitting stops when data can t be split any further All examples in the same node belong to the same class No new attribute can be used to split the data Decision tree overfits the training data (including noise), and won t perform well in predicting new data. Overfitting! 46

48 Pre-Pruning (Early Stopping) Stop the algorithm before it becomes a fully-grown tree. More restrictive conditions: Stop if number of instances is less than some userspecified threshold Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain) Other statistical method (e.g., using 2 test) In Practice, pre-pruning is usually not preferred, due to stop too early. 47

49 Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion Class label of leaf node is determined from majority class of instances Different strategies to prune the tree (not discussed) Post-pruning usually has better predictive performance, but is more complex and time-consuming. 48

50 Summary on Decision Tree Other practical issues Handle numeric values Handle missing values Advantages: Simple Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets 49

51 Id3: \Weka\contact_lenses Id3 limitations J48: Demonstration \Weka\contact_lenses (vs Id3) \UCI\autos (numeric value, missing value) \UCI\kr-vs-kp (vs Id3) \UCI\splice (vs Id3) 50

53 Bayesian Classifier A probabilistic framework for solving classification problems, based on Bayes theorem Recall Bayes theorem Conditional Probability: Bayes theorem: ) ( ) ( ) ( ) ( A P C P C A P A C P ) ( ), ( ) ( ; ) ( ), ( ) ( C P A C P C A P A P A C P A C P 52

54 Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what s the probability he/she has meningitis? P( S M ) P( M ) 0.5 1/ P( M S) P( S) 1/

55 Bayesian Classifiers (1) Given a new example with attribute values (a 1, a 2,, a n ) Goal is to predict class C E.g., C = {c 1, c 2 } Estimate P(C a 1, a 2,, a n ). E.g., p(c 1 a 1, a 2,, a n ) = 0.9; p(c 2 a 1, a 2,, a n ) = 0.1; Find the value of C that maximizes P(C a 1, a 2,, a n ) P(c 1 a 1, a 2,, a n ) > P(c 2 A 1, A 2,, A n ) Predicting C= c 1 Can we estimate P(C a 1, a 2,, a n ) from training data? 54

56 Approach: Bayesian Classifiers (2) compute the posterior probability P(C a 1, a 2,, a n ) for all values of C using the Bayes theorem P( C a a 1 2 a P( a1a2 an C) P( C) P( a a a ) Choose value of C that maximizes P(C a 1, a 2,, a n ) Equivalent to choosing value of C that maximizes P(a 1, a 2,, a n C) P(C) How to estimate P(a 1, a 2,, a n C ) and P(C)? n ) 1 2 n 55

57 Estimating P(C) Given n training examples; C={c 1, c 2,, c j, } p( c j # examples ( C ) n c j ) age income student credit_rating Class Example: C = {yes, no} P(C=yes) = 9/14=0.643 P(C=no) = 5/14=0.357 <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes high yes fair yes >40 medium no excellent no 56

58 Estimating P(a 1, a 2,, a n C ) Assumption: given the class value, attributes are statistically independent i.e., If the class is known, knowing the value of one attribute says nothing about the value of another Mathematically, Estimating p(a i c j ) from given training data ) ( ) ( ) ( ) ( ),,,, ( j n j j j j n c a p c a p c a p c a p c a a a a p ) ( # ), ( # ) ( j j i i j i c C examples c C a A examples c a p 57

59 Example - Estimating P(a 1,a 2,,a n C) Estimating: P(age <=30, Income=medium, Student=yes, Credit=Fair Class=Yes) age income student credit_rating Class <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes P(age<=30 class=yes)=2/9 p(income=medium class=yes)=4/9 p(student=yes class=yes)=6/9 p(credit=fair class=yes)=6/9 >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes P( ) = 2/9 * 4/9 * 6/9 * 6/9 = high yes fair yes >40 medium no excellent no 58

60 Example Naïve Bayes Classifier New example to be classified: X = (age <=30, Income=medium, Student = yes, Credit = Fair) Estimating P(C) P(class=yes)=0.643, p(class=no)=0.357 Estimating P(a 1, a 2,, a n C ) P(X class=yes) = P(X class=no) = Calculating P(C)P(a 1, a 2,, a n C ) P(X class=yes)*p(class=yes) = P(X class=no)*p(class=no) = Therefore, X belongs to class YES age income student credit_rating Class <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes high yes fair yes >40 medium no excellent no 59

61 Exercise Naïve Bayes Classifier Name Give Birth Can Fly Live in Water Have Legs Class human yes no no yes mammals python no no no no non-mammals salmon no no yes no non-mammals whale yes no yes no mammals frog no no sometimes yes non-mammals komodo no no no yes non-mammals bat yes yes no yes mammals pigeon no yes no yes non-mammals cat yes no no yes mammals leopard shark yes no yes no non-mammals turtle no no sometimes yes non-mammals penguin no no sometimes yes non-mammals porcupine yes no no yes mammals eel no no yes no non-mammals salamander no no sometimes yes non-mammals gila monster no no no yes non-mammals platypus no no no yes mammals owl no yes no yes non-mammals dolphin yes no yes no mammals eagle no yes no yes non-mammals Give Birth Can Fly Live in Water Have Legs Class yes no yes no? 60

62 Exercise - Answer Name Give Birth Can Fly Live in Water Have Legs Class human yes no no yes mammals python no no no no non-mammals salmon no no yes no non-mammals whale yes no yes no mammals frog no no sometimes yes non-mammals komodo no no no yes non-mammals bat yes yes no yes mammals pigeon no yes no yes non-mammals cat yes no no yes mammals leopard shark yes no yes no non-mammals turtle no no sometimes yes non-mammals penguin no no sometimes yes non-mammals porcupine yes no no yes mammals eel no no yes no non-mammals salamander no no sometimes yes non-mammals gila monster no no no yes non-mammals platypus no no no yes mammals owl no yes no yes non-mammals dolphin yes no yes no mammals eagle no yes no yes non-mammals Give Birth Can Fly Live in Water Have Legs Class yes no yes no? A: attributes M: mammals; N: non-mammals 7 P( M ) P( N) P( A M ) P( A N) P( A M ) P( M ) P( A N) P( N) P(A M)P(M) > P(A N)P(N) => Mammals 61

63 Summary - Naïve Bayes It is call naive due to the strong independence assumption (which makes the estimation much easier). The independence assumption is almost never correct! But this scheme works well in practice. Extension: Handling missing values Handling numeric values 62

64 Demonstration Naïve Bayes: \Weka\contact_lenses (vs Id3) \UCI\autos (vs J48) \UCI\kr-vs-kp (vs Id3, J48) \UCI\splice (vs Id3, J48) 63

66 Instance-Based Classifiers Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training records Use training records to predict the class labels of unseen cases (without building an explicit model; called lazy learner ) Unseen Case Atr1... AtrN 65

67 Rote-learner Examples Memorize entire training data and performs classification only if the new example is identical to one of the training examples Nearest neighbor Identify k closest points (nearest neighbors), and perform classification (based on majority vote, etc.) 66

68 Intuition If it looks like a duck, walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records 67

69 Basic Process Building model No explicit model Set-up Distance Metric (to indentify the nearest neighbors) Unknown record The value of k Making prediction Compute distance to training examples Identify k nearest neighbors Take majority vote to determine the class label 68

70 Distance Metric Most instance-based algorithm use Euclidean distance: a (1) and a (2) : two instances with n attributes Example: (1) (2) 2 (1) (2) 2 (1) (2) 2 ( a1 a1 ) ( a2 a2 )... ( a n an ) a (1) = [2, 9, 7, ]; a (2) = [3, 2, 6, ] Distance(a (1), a (2) ) = (2 3) 2 (9 2) 2 (7 6) 2 Other popular metric: Manhattan metric Distance for nominal features 1 if values are different, 0 if they are equal 69

71 Distance Metric --- Normalization Example: 3 features: height, weight, income x1=[1.6(m), 110(lb), 30,000($)]; x2=[1.9(m), 300(lb), 35,000($)] Distance(x1,x2) =? Normalization Attributes may have to be scaled (normalized) to prevent distance measures from being dominated by one of the attributes vi min vi ai max v min v (v i : the actual value of attribute i) i E.g. Suppose height ranges [1.5, 2] Height(x1) = ( ) / (2 1.5) = 0.2 Height(x2) = ( ) / (2 1.5) = 0.8 i 70

72 Choosing K X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K is usually chosen from odd numbers (1, 3, 5, 7, 9, ) Don t have to break ties K is usually set manually, according to the performance of the algorithm on a separate validation set 71

73 Voronoi Diagram of 1NN 72

74 Summary K-NN classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree Lazy learners vs eager learners Lazy learners (such as, KNN, etc.) No explicit model; no training; slow testing / predicting Eager learners (such as, Decision tree, etc.) Explicit model; (relative) slow training; fast testing / predicting 73

75 Demonstration IB1, IBk: \Weka\contact_lenses (vs J48) \UCI\autos (vs J48) \UCI\breast-w (vs J48) 74