Data Mining Classification Jingpeng Li 1 of 26 What is Classification? Assigning an object to a certain class based on its similarity to previous examples of other objects Can be done with reference to original data or based on a model of that data E.g: Me: Its round, green, and edible You: It s an apple! 2 of 26 1
Usual Examples Classifying transactions as genuine or fraudulent e.g credit card usage, insurance claims, cell phone calls Classifying prospects as good or bad customers Classifying engine faults by their symptoms 3 of 26 Certainty As with most data mining solutions, a classification usually comes with a degree of certainty. It might be the probability of the object belonging to the class or it might be some other measure of how closely the object resembles other examples from that class 4 of 26 2
Techniques Non-parametric, e.g. K nearest neighbour Mathematical models, e.g. neural networks Rule based models, e.g. decision trees 5 of 26 Predictive / Definitive Classification may indicate a propensity to act in a certain way, e.g. A prospect is likely to become a customer. This is predictive. Classification may indicate similarity to objects that are definitely members of a given class, e.g. small, round, green = apple 6 of 26 3
Simple Worked Example Risk of making a claim on a motor insurance policy This is a predictive classification they haven t made the claim yet, but do they look like other people who have? To keep it simple, let s look at just age and gender 7 of 26 The Data Age Gender Claim? 30 Female No 31 Male No 27 Male No 20 Male Yes 29 Female No 32 Male No 46 Male No 45 Male No 33 Male No 25 Female No 38 Female No 21 Female No 38 Female No 42 Male No 29 Male No 37 Male No 40 Female No Age 30 Claim No claim Male Female 8 of 26 4
K-Nearest Neighbour Performed on raw data Count number of other examples that are close Winner is most common Age 30 Male Female New person to classify 9 of 26 Rule Based If Gender = Male and Age < 30 then Claim If Gender = Male and Age > 30 then No Claim Etc New person to classify Age 30 Male Female 10 of 26 5
Decision Trees A good automatic rule discovery technique is the decision tree Produces a set of branching decisions that end in a classification Works best on nominal attributes numeric ones need to be split into bins 11 of 26 A Decision Tree Legs Note: Not all attributes are used in all decisions 4 0 2 Size Swims? Med Small Y N Cat Mouse Fish Snake Bird 12 of 26 6
Making a Classification Each node represents a single variable Each branch represents a value that variable can take To classify a single example, start at the top of the tree and see which variable it represents Follow the branch that corresponds to the value that variable takes in your example Keep going until you reach a leaf, where your object is classified! University of Stirling 2016 13 of 26 Tree Structure There are lots of ways to arrange a decision tree Does it matter which variables go where? Yes: You need to optimise the number of correct classifications You want to make the classification process as fast as possible 14 of 26 7
A Tree Building Algorithm Divide and Conquer: Choose the variable that is at the top of the tree Create a branch for each possible value For each branch, repeat the process until there are no more branches to make (i.e. stop when all the instances at the current branch are in the same class) But how do you choose which variable to split? 15 of 26 The ID3 Algorithm Split on the variable that gives the greatest information gain Information can be thought of as a measure of uncertainty Information is a measure based on the probability of something happening 16 of 26 8
Information Example If I pick a random card form a deck and you have to guess what it is, which would you rather be told: It is red (which has a probability of 0.5), or it is a picture card (which has a probability of 4/13 = 0.31) 17 of 26 Calculating Information The information associated with a single event: I(e) = -log(p e ) where p e is the probability of event e occurring I(Red) = -log(0.5) = 1 I(Picture card) = -log(0.31) = 1.7 18 of 26 9
Average Information The weighted average information across all possible values of a variable is called Entropy. It is calculated as the sum of the probability of each possible event times its information value: P( xi ) I( xi ) H X ) P( x )log( P( x )) ( i i where log is the base 2 log. 19 of 26 Entropy of IsPicture? I(Picture) = -log(4/13) = 1.7 I(Not Picture) = -log(9/13) = 0.53 H = 4/13*1.7 + 9/13*0.53 =0.89 Entropy H(X) is a measure of uncertainty in variable X The more even the distribution of X becomes, the higher the entropy gets 20 of 26 10
Unfair Coin Entropy The more even the distribution of X becomes, the higher the entropy gets 21 of 26 Conditional Entropy We now introduce conditional entropy: H(outcome known) The uncertainty about the outcome, given that we know known 22 of 26 11
Information Gain If we know H(Outcome) And we know H(Outcome Input) We can calculate how much Input tells us about Outcome simply as: H(Outcome) - H(Outcome Input) This is the information gain of Input 23 of 26 Picking the Top Node ID3 picks the top node of the network by calculating the information gain of the output class for each input variable, and picks the one that removes the most uncertainty It creates a branch for each value the chosen variable can take 24 of 26 12
Adding Branches Branches are added by making the same information gain calculation for data defined by the location on the tree of the current branch If all objects at the current leaf are in the same class, no more branching is needed The algorithm also stops when all the data has been accounted for 25 of 26 Person Hair Length Weight Age Class Homer 0 250 36 M Marge 10 150 34 F Bart 2 90 10 M Lisa 6 78 8 F Maggie 4 20 1 F Abe 1 170 70 M Selma 8 160 41 F Otto 10 180 38 M Krusty 6 200 45 M Comic 8 290 38? 13
p p n n Entropy ( S) log 2 log 2 p n p n p n p n yes no Hair Length <= 5? Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Let us try splitting on Hair length Gain( A) E( Current set) E( all child sets ) Gain(Hair Length <= 5) = 0.9911 (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911 p p n n Entropy ( S) log 2 log 2 p n p n p n p n yes Weight <= 160? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Let us try splitting on Weight Gain( A) E( Current set) E( all child sets ) Gain(Weight <= 160) = 0.9911 (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 14
p p n n Entropy ( S) log 2 log 2 p n p n p n p n yes age <= 40? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Let us try splitting on Age Gain( A) E( Current set) E( all child sets ) Gain(Age <= 40) = 0.9911 (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified So we simply recurse! yes Weight <= 160? no This time we find that we can split on Hair length, and we are done! yes no Hair Length <= 2? 15
We need don t need to keep the data around, just the test conditions. Weight <= 160? How would these people be classified? yes Hair Length <= 2? no Male yes no Male Female It is trivial to convert Decision Trees to rules Weight <= 160? yes Hair Length <= 2? yes no no Male Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female 16
Other Classification Methods You will meet a certain type of neural network in a later lecture these too are good at classification There are many, many, many other methods for building classification systems 33 of 26 17