Information Retrieval and Data Mining. Summer Semester 2015 TU Kaiserslautern

Transcription

1 Information Retrieval and Data Mining Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) 1

2 Chapter VI: Classification 1. Motivation and Definitions 2. Decision Trees 3. Bayes Classifier 4. Support Vector Machines (only as teaser) Tan, Steinbach & Kumar, Chapter 8 2

3 1. Classification: Example Classifier age? student? youth middle_age? yes senior credit_rating? no yes fair excellent no yes no yes A decision tree for the concept buys_computer, indicating whether a customer at an electronic shop is likely to purchase a computer. source: Han&Kamber 3

4 Classification: Definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 4

5 Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Categorizing news stories as finance, weather, entertainment, sports, etc Classifying persons into tax evaders and tax payers. 5

6 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? Induction Learning algorithm Learn Model Apply Model Model 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Deduction Test Set 6

7 Classification model evaluation Much the same measures as with IR methods Focus on accuracy and error rate Predicted class Class = 1 Class = 0 Class = 1 f11 f10 Class = 0 f01 f00 But also precision, recall, F-scores, 7

8 Overview Classification Techniques Decision-Tree-based Methods Rule-based Methods Naïve Bayes Support Vector Machines 8

9 10 Example of a Decision Tree Tid Refund Marital Status Taxable Income Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO 9 No Married 75K No 10 No Single 90K Yes NO YES Training Data Model: Decision Tree 9

10 10 2. Decision Trees Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No Married NO MarSt Yes Single, Divorced Refund No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes 6 No Married 60K No NO YES 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes There could be more than one tree that fits the same data! 10

11 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? Induction Tree Induction algorithm Learn Model Apply Model Model Decision Tree 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? Deduction 15 No Large 67K? Test Set 11

12 10 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Refund No Married 80K? Yes No NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 12

13 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 13

17 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Refund No Married 80K? Yes No NO Single, Divorced MarSt Married Assign to No TaxInc NO < 80K > 80K NO YES 17

18 Classifying a Record with a Decision Tree Given a decision tree. How to classify a test record? Start at root note and apply the test condition to the record and follow the appropriate branch. If this leads to internal node, again apply test condition and follow branch. Otherwise, if at leave node, assign class of leave node to record. Repeat until at leave node. 18

19 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Tree Induction algorithm Learn Model Apply Model Model Decision Tree 19

20 Constructing Decision Tree There are exponentially many decision trees for the training data. Finding optimal tree is computationally infeasible. Instead, use greedy algorithms: Series of local split operations to grow the tree. Not optimal, but there are efficient algorithms that create sufficiently accurate trees. 20

21 10 General Structure of Hunt s Algorithm Let D t be the set of training records that reach a node t General Procedure: If D t contains records that belong the same class y t, then t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled by the default class, y d If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes? D t 21

22 10 Hunt s Algorithm Don t Start with most frequent class as default class. Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 22

23 10 Hunt s Algorithm (2) Don t Yes Don t Refund No Don t Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 23

24 10 Hunt s Algorithm (3) Don t Yes Don t Refund No Don t Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No Refund 7 Yes Divorced 220K No Yes No 8 No Single 85K Yes Don t Single, Divorced Marital Status Married Don t 9 No Married 75K No 10 No Single 90K Yes 24

25 10 Hunt s Algorithm (4) Don t Yes Don t Refund No Don t Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes Yes Refund Don t Single, Divorced No Marital Status Married Don t Yes Refund Don t Single, Divorced Don t Taxable Income No Marital Status < 80K >= 80K Married Don t 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 25

26 Greedy strategy. Tree Induction Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting 26

27 How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split 27

28 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports} 28

29 Splitting Based on Continuous Attributes Different ways of handling continuous attributes Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut can be more compute intensive 30

30 Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split 31

32 How to determine the Best Split Before Splitting: 10 records of class 0 10 records of class 1 Own Car? Car Type? Student ID? Yes No Family Luxury c 1 c 10 c 20 Sports c 11 C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0... C0: 1 C1: 0 C0: 0 C1: 1... C0: 0 C1: 1 Which test condition is the best? 33

33 How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity 34

34 Selecting the Best Split Let p(i t) be the fraction of records belonging to class i at node t Best split is selected based on the degree of impurity of the child nodes p(0 t) = 0 and p(1 t) = 1 has high purity p(0 t) = 1/2 and p(1 t) = 1/2 has the smallest purity (highest impurity) Intuition: high purity small value of impurity measures better split 35

35 Example of Purity high impurity high purity 36

36 Impurity Measures 37

37 Examples for Computing Entropy C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log 0 1 log 1 = 0 0 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (1/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) =

38 Examples for computing GINI C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1) 2 P(C2) 2 = = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6) 2 (5/6) 2 = C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6) 2 (4/6) 2 = Information Retrieval and Data Mining, SoSe 2015, S. Michel 39

39 Comparing Conditions The quality of the split: the change in the impurity Called the gain of the test condition I( ) is the impurity measure k is the number of attribute values p is the parent node, v j is the child node N is the total number of records at the parent node N(v j ) is the number of records associated with the child node Maximizing the gain minimizing the weighted average impurity measure of child nodes If I() = Entropy(), then Δ = Δinfo is called information gain 40

40 How to Find the Best Split Before Splitting: C0 C1 N00 N01 M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 C0 N10 C0 N20 C0 N30 C0 N40 C1 N11 C1 N21 C1 N31 C1 N41 M1 M2 M3 M4 M12 Gain = M0 M12 vs M0 M34 M34 41

41 Problems of maximizing Δ Higher purity 42

42 Problems of Maximizing Δ Impurity measures favor attributes with large number of values A test condition with large number of outcomes might not be desirable Number of records in each partition is too small to make predictions Solution 1: gain ratio = Δinfo / SplitInfo P(v i ) = the fraction of records at child; k = total number of splits Solution 2: restrict the splits to binary 43

44 Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have the same or similar attribute values. In this case the class with majority wins. 45

45 Overfitting and Tree Pruning Common problem with decision trees is that tree might be too tightly tailored to training data (and thus possibly to noise in data). Good: error on training data might be very low But what about test data previously unseen? Idea: Avoid tree becoming too fine-grained. Solution 1: Stop splitting nodes early (i.e., preprocessing) Solution 2: Build tree regularly and then prune parts of it (i.e., postprocessing) 46

46 Example: Training Data Example of overfitting due to noisy training data.. *) wrong class 47

47 Example: Two Different Decision Trees 48

48 Example: Test Data Let s see how the trees M1 and M2 perform on test and training data. M1: 0% error on training data, but 30% error on test data! Errors marked with 1 M2: 20% error on training data, but 10% error on test data! 2 table source: Tan,Steinbach, Kumar 49

49 2. (Naive) Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem: ) ( ) ( ) ( ) ( A P C P C A P A C P ) ( ), ( ) ( ) ( ), ( ) ( C P A C P C A P A P A C P A C P 50 Tan,Steinbach, Kumar

50 Given: Example of Bayes Theorem A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what s the probability he/she has meningitis? P( S M ) P( M ) 0.5 1/ P( M S) P( S) 1/

51 Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) Can we estimate P(C A 1, A 2,,A n ) directly from data? 52

52 Bayesian Classifiers Approach: compute the posterior probability P(C A 1, A 2,, A n ) for all values of C using the Bayes theorem P( C A A 1 2 A n ) P( A A A C) P( C) 1 2 n P( A A A ) 1 2 n Choose value of C that maximizes P(C A 1, A 2,, A n ) Equivalent to choosing value of C that maximizes P(A 1, A 2,, A n C) P(C) How to estimate P(A 1, A 2,, A n C )? 53

53 Naïve Bayes Classifier Assume independence among attributes A i when class is given: P(A 1, A 2,, A n C) = P(A 1 C j ) P(A 2 C j ) P(A n C j ) Can estimate P(A i C j ) for all A i and C j. New point is classified to C j if P(C j ) P(A i C j ) is maximal. 54

54 10 How to Estimate Probabilities from Class: P(C) = N c /N e.g., P(No) = 7/10, P(Yes) = 3/10 Data? categorica l Tid Refund Marital Status categorica l Taxable Income continuous 1 Yes Single 125K No 2 No Married 100K No Evade class For discrete attributes: P(A i C k ) = A ik / N c where A ik is number of instances having attribute A i and belongs to class C k Examples: P(Status=Married No) = 4/7 P(Refund=Yes Yes)=0 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes k 55

55 How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumption Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(A i c) k 56

56 10 How to Estimate Probabilities from Normal distribution: Data? categorica l Tid Refund Marital Status categorica l Taxable Income continuous 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No clas Evade One for each (A i,c i ) pair For (Income, Class=No): If Class=No sample mean = 110 sample variance = Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 57

57 Example of Naïve Bayes Classifier Given a Test Record: X naive Bayes Classifier: ( Refund No,Married, Income 120K) P(Refund=Yes No) = 3/7 P(Refund=No No) = 4/7 P(Refund=Yes Yes) = 0 P(Refund=No Yes) = 1 P(Marital Status=Single No) = 2/7 P(Marital Status=Divorced No)=1/7 P(Marital Status=Married No) = 4/7 P(Marital Status=Single Yes) = 2/7 P(Marital Status=Divorced Yes)=1/7 P(Marital Status=Married Yes) = 0 For taxable income: If class=no: sample mean=110 sample variance=2975 If class=yes: sample mean=90 sample variance=25 P(X Class=No) = P(Refund=No Class=No) P(Married Class=No) P(Income=120K Class=No) = 4/7 4/ = P(X Class=Yes) = P(Refund=No Class=Yes) P(Married Class=Yes) P(Income=120K Class=Yes) = = 0 Since P(X No)P(No) > P(X Yes)P(Yes) Therefore P(No X) > P(Yes X) => Class = No 58

58 3. Support Vector Machines Idea: Find a linear hyperplane (decision boundary) that will separate the data 59

59 Support Vector Machines One Possible Solution B 1 60

60 Support Vector Machines Another possible solution B 2 61

61 Support Vector Machines Other possible solutions B 2 62

62 Support Vector Machines B 1 B 2 Which one is better? B1 or B2? How do you define better? 63

63 Support Vector Machines B 1 B 2 b 21 b 22 margin b 11 Find hyperplane maximizes the margin => B1 is better than B2 b 12 64

64 Support Vector Machines B 1 w x b 0 w x b 1 w x b 1 b 11 1 if w x b 1 ( x) 1 if w x b 1 f 2 b 12 2 Margin w 65

65 Summary Data Mining Frequent Itemset and Association Rule Mining: Apriori Principle and Algorithm Clustering: K-means Hierarchical clustering DBSCAN (density based clustering) Classification: Decision trees Naïve Bayes Classifier Support Vector Machines (SVMs) 66