SVM and Decision Tree

SVM and Decision Tree Le Song Machine Learning I CSE 6740, Fall 2013

Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error Class 2 But there are many such decision boundaries Which one is better? Class 1 2

Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error? 3

Constraints on data points Constraints on data points For all x in class 2, y = 1 and w x + b c For all x in class 1, y = 1 and w x + b c Or more compactly, (w x + b)y c w T x b 0 w Class 2 Class 1 c c 4

Classifier margin Pick two data points x 1 and x 2 which are on each dashed line respectively The margin is γ = 1 w w x 1 x 2 = 2c w w T x b 0 w x 1 Class 2 x 2 Class 1 c c 5

Maximum margin classifier Find decision boundary w as far from data point as possible 2c max w,b w s. t. y i w x i + b c, i w T x b 0 w x 1 Class 2 x 2 Class 1 c c 6

Support vector machines with hard margin min w,b w 2 s. t. y i w x i + b 1, i Convert to standard form 1 min w,b 2 w w s. t. 1 y i w x i + b 0, i The Lagrangian function m L w, α, β = 1 2 w w + α i 1 y i w x i + b i 7

Deriving the dual problem m L w, α, β = 1 2 w w + α i 1 y i w x i + b i Taking derivative and set to zero m L w = w α iy i x i = i m L b = α iy i = 0 i 0 8

Plug back relation of w and b L w, α, β = 1 2 m i m i α i y i x i m j α j y j x j + α i 1 y i m j α j y j x j x i + b After simplification m 1 L w, α, β = α i 2 i m i,j α i α j y i y j x i x j 9

The dual problem max α m i 1 α i 2 m i,j α i α j y i y j x i x j s. t. α i 0, i = 1,, m m i α i y i = 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found w can be found as w = How about b? m i α i y i x i 10

Support vectors Note that the KKT condition α i 1 y i w x i + b = 0 For data points with 1 y i w x i + b < 0, α i = 0 For data points with 1 y i w x i + b = 0, α i > 0 Class 2 a 5 =0 a 8 =0.6 a 10 =0 a 7 =0 a 2 =0 Call the training data points whose a i 's are nonzero the support vectors (SV) a 4 =0 a 6 =1.4 a 1 =0.8 a 9 =0 Class 1 a 3 =0 11

Computing b and obtain the classifer Pick any data point with α i > 0, solve for b with 1 y i w x i + b = 0 For a new test point z Compute w z + b = α i y i x i z + b i support vectors Classify z as class 1 if the result is positive, and class 2 otherwise 12

Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This sparse representation can be viewed as data compression To compute the weights α i, and to use support vector machines we need to specify only the inner products (or kernel) between the examples x i x j We make decisions by comparing each new example z with only the support vectors: y = sign α i y i x i z + b i support vectors 13

Soft margin constraints What if the data is not linearly separable? We will allow points to violate the hard margin constraint (w x + b)y 1 ξ w T x b 0 w ξ 1 ξ 2 ξ 3 Class 2 Class 1 1 1 14

Soft margin SVM min w,b,ξ m w 2 + C ξ i i=1 s. t. y i w x i + b 1 ξ i, ξ i 0, i Convert to standard form 1 min w,b 2 w w s. t. 1 y i w x i + b ξ i 0, ξ i 0, i The Lagrangian function m L w, α, β = 1 2 w w + Cξ i + α i 1 y i w x i + b ξ i i β i ξ i 15

Deriving the dual problem m L w, α, β = 1 2 w w + Cξ i + α i 1 y i w x i + b ξ i i β i ξ i Taking derivative and set to zero m L w = w α iy i x i = 0 i m L b = α iy i = 0 i L ξ i = C α i β i = 0 16

Plug back relation of w, b and ξ L w, α, β = 1 2 m i m i α i y i x i m j α j y j x j + α i 1 y i m j α j y j x j x i + b After simplification m 1 L w, α, β = α i 2 i m i,j α i α j y i y j x i x j 17

The dual problem max α m i 1 α i 2 m i,j α i α j y i y j x i x j s. t. C α i β i = 0, α i 0, β i 0, i = 1,, m m i α i y i = 0 The constraint C α i β i = 0, α i 0, β i 0 can be simplified to C α i 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found 18

Learning nonlinear decision boundary Linearly separable Nonlinearly separable The XOR gate Speech recognition 19

A decision tree for Tax Fraud Input: a vector of attributes X = [Refund,MarSt,TaxInc] Output: Y= Cheating or Not H as a procedure: Yes Refund No Single, Divorced MarSt Married Each internal node: test one attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y TaxInc < 80K > 80K YES 20

10 Apply model to test data I Start from the root of tree. Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 21

10 Apply model to test data II Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 22

10 Apply model to test data III Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 23

10 Apply model to test data IV Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 24

10 Apply model to test data V Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES 25

Expressiveness of decision tree Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. Prefer to find more compact decision trees 26

Hypothesis spaces (model space How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set may get worse predictions 27

10 10 Decision tree learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? Apply Model Model Decision Tree 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 28

10 Example of a decision tree T id R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Splitting Attributes 1 Y e s S in g le 1 2 5 K No 2 No M a rrie d 1 0 0 K No 3 No S in g le 70K No Yes Refund No 4 Y e s M a rrie d 1 2 0 K No 5 No D iv o rc e d 95K Y e s 6 No M a rrie d 60K No 7 Y e s D iv o rc e d 2 2 0 K No 8 No S in g le 85K Y e s MarSt Single, Divorced TaxInc < 80K > 80K Married 9 No M a rrie d 75K No 10 No S in g le 90K Y e s YES Training Data Model: Decision Tree 29

10 Another example of a decision tree T id R e fu n d M a rita l T a x a b le S ta tu s In c o m e C h e a t 1 Y e s S in g le 1 2 5 K No 2 No M a rrie d 1 0 0 K No Married MarSt Yes Single, Divorced Refund No 3 No S in g le 70K No TaxInc 4 Y e s M a rrie d 1 2 0 K No < 80K > 80K 5 No D iv o rc e d 95K Y e s 6 No M a rrie d 60K No 7 Y e s D iv o rc e d 2 2 0 K No YES 8 No S in g le 85K Y e s 9 No M a rrie d 75K No 10 No S in g le 90K Y e s There could be more than one tree that fits the same data! Training Data 30

Top-Down Induction of Decision tree Main loop: A the best decision attribute for next node Assign A as the decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, then STOP; ELSE iterate over new leaf nodes 31

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting 32

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports}

Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Small Size Medium Large Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} OR {Medium, Large} Size {Small}

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < t) or (A t) consider all possible splits and finds the best cut can be more compute intensive Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split

How to determine the Best Split Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Homogeneous, Low degree of impurity Non-homogeneous, High degree of impurity Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity

How to compare attribute? Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Information theory: Most efficient code assigns -log 2 P(X=i) bits to encode the message X=I, So, expected number of bits to code one random X is:

Sample Entropy S is a sample of training examples p + is the proportion of positive examples in S p - is the proportion of negative examples in S Entropy measure the impurity of S

Examples for computing Entropy C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log 0 1 log 1 = 0 0 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (1/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92

How to compare attribute? Conditional Entropy of variable X given variable Y Given specific Y=v entropy H(X Y=v) of X: Conditional entropy H(X Y) of X: average of H(X Y=v) Mutual information (aka information gain) of X given Y :

Information Gain Information gain (after split a node): GAIN split Entropy ( p ) k i 1 n n i Entropy ( i ) n samples in parent node p is split into k partitions; n i is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

Problem of splitting using information gain Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Gain Ratio: GainRATIO GAIN Split SplitINFO split SplitINFO k i 1 n n i log n n i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain 42

Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later)

Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz