SVM and Decision Tree

Similar documents
Data Mining Classification: Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining for Knowledge Management. Classification

Introduction to Learning & Decision Trees

Support Vector Machine (SVM)

Decision Trees from large Databases: SLIQ

Social Media Mining. Data Mining Essentials

A Simple Introduction to Support Vector Machines

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Linear Threshold Units

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Support Vector Machines

Lecture 10: Regression Trees

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Decision-Tree Learning

Big Data Analytics CSCI 4030

Classification and Prediction

Support Vector Machines Explained

Question 2 Naïve Bayes (16 points)

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Distributed Machine Learning and Big Data

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Server Load Prediction

Data mining techniques: decision trees

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

How To Make A Credit Risk Model For A Bank Account

Lecture 2: The SVM classifier

Support Vector Machines with Clustering for Training with Very Large Datasets

Gerry Hobbs, Department of Statistics, West Virginia University

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Data Mining Practical Machine Learning Tools and Techniques

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Web Document Clustering

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Data Mining on Streams

Data Mining Techniques Chapter 6: Decision Trees

Professor Anita Wasilewska. Classification Lecture Notes

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Sub-class Error-Correcting Output Codes

Neural Networks and Support Vector Machines

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Nonlinear Optimization: Algorithms 3: Interior-point methods

Chapter 12 Discovering New Knowledge Data Mining

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Active Learning SVM for Blogs recommendation

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

A fast multi-class SVM learning method for huge databases

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

Several Views of Support Vector Machines

Employer Health Insurance Premium Prediction Elliott Lui

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Classification algorithm in Data mining: An Overview

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Discuss the size of the instance for the minimum spanning tree problem.

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Less naive Bayes spam detection

Spam Detection A Machine Learning Approach

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Predict Influencers in the Social Network

Environmental Remote Sensing GEOG 2021

Data mining and statistical models in marketing campaigns of BT Retail

CSCI567 Machine Learning (Fall 2014)

E-commerce Transaction Anomaly Classification

Linear Programming. March 14, 2014

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

IBM SPSS Direct Marketing 23

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Chapter 20: Data Analysis

Microsoft Azure Machine learning Algorithms

Approximation Algorithms

Nonlinear Programming Methods.S2 Quadratic Programming

IBM SPSS Direct Marketing 22

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Discrete Optimization

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Foundations of Artificial Intelligence. Introduction to Data Mining

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

D A T A M I N I N G C L A S S I F I C A T I O N

Data Mining. Nonlinear Classification

Scheduling Algorithm with Optimization of Employee Satisfaction

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Fast Analytics on Big Data with H20

L25: Ensemble learning

International Journal of Advanced Research in Computer Science and Software Engineering

Transcription:

SVM and Decision Tree Le Song Machine Learning I CSE 6740, Fall 2013

Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error Class 2 But there are many such decision boundaries Which one is better? Class 1 2

Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error? 3

Constraints on data points Constraints on data points For all x in class 2, y = 1 and w x + b c For all x in class 1, y = 1 and w x + b c Or more compactly, (w x + b)y c w T x b 0 w Class 2 Class 1 c c 4

Classifier margin Pick two data points x 1 and x 2 which are on each dashed line respectively The margin is γ = 1 w w x 1 x 2 = 2c w w T x b 0 w x 1 Class 2 x 2 Class 1 c c 5

Maximum margin classifier Find decision boundary w as far from data point as possible 2c max w,b w s. t. y i w x i + b c, i w T x b 0 w x 1 Class 2 x 2 Class 1 c c 6

Support vector machines with hard margin min w,b w 2 s. t. y i w x i + b 1, i Convert to standard form 1 min w,b 2 w w s. t. 1 y i w x i + b 0, i The Lagrangian function m L w, α, β = 1 2 w w + α i 1 y i w x i + b i 7

Deriving the dual problem m L w, α, β = 1 2 w w + α i 1 y i w x i + b i Taking derivative and set to zero m L w = w α iy i x i = i m L b = α iy i = 0 i 0 8

Plug back relation of w and b L w, α, β = 1 2 m i m i α i y i x i m j α j y j x j + α i 1 y i m j α j y j x j x i + b After simplification m 1 L w, α, β = α i 2 i m i,j α i α j y i y j x i x j 9

The dual problem max α m i 1 α i 2 m i,j α i α j y i y j x i x j s. t. α i 0, i = 1,, m m i α i y i = 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found w can be found as w = How about b? m i α i y i x i 10

Support vectors Note that the KKT condition α i 1 y i w x i + b = 0 For data points with 1 y i w x i + b < 0, α i = 0 For data points with 1 y i w x i + b = 0, α i > 0 Class 2 a 5 =0 a 8 =0.6 a 10 =0 a 7 =0 a 2 =0 Call the training data points whose a i 's are nonzero the support vectors (SV) a 4 =0 a 6 =1.4 a 1 =0.8 a 9 =0 Class 1 a 3 =0 11

Computing b and obtain the classifer Pick any data point with α i > 0, solve for b with 1 y i w x i + b = 0 For a new test point z Compute w z + b = α i y i x i z + b i support vectors Classify z as class 1 if the result is positive, and class 2 otherwise 12

Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This sparse representation can be viewed as data compression To compute the weights α i, and to use support vector machines we need to specify only the inner products (or kernel) between the examples x i x j We make decisions by comparing each new example z with only the support vectors: y = sign α i y i x i z + b i support vectors 13

Soft margin constraints What if the data is not linearly separable? We will allow points to violate the hard margin constraint (w x + b)y 1 ξ w T x b 0 w ξ 1 ξ 2 ξ 3 Class 2 Class 1 1 1 14

Soft margin SVM min w,b,ξ m w 2 + C ξ i i=1 s. t. y i w x i + b 1 ξ i, ξ i 0, i Convert to standard form 1 min w,b 2 w w s. t. 1 y i w x i + b ξ i 0, ξ i 0, i The Lagrangian function m L w, α, β = 1 2 w w + Cξ i + α i 1 y i w x i + b ξ i i β i ξ i 15

Deriving the dual problem m L w, α, β = 1 2 w w + Cξ i + α i 1 y i w x i + b ξ i i β i ξ i Taking derivative and set to zero m L w = w α iy i x i = 0 i m L b = α iy i = 0 i L ξ i = C α i β i = 0 16

Plug back relation of w, b and ξ L w, α, β = 1 2 m i m i α i y i x i m j α j y j x j + α i 1 y i m j α j y j x j x i + b After simplification m 1 L w, α, β = α i 2 i m i,j α i α j y i y j x i x j 17

The dual problem max α m i 1 α i 2 m i,j α i α j y i y j x i x j s. t. C α i β i = 0, α i 0, β i 0, i = 1,, m m i α i y i = 0 The constraint C α i β i = 0, α i 0, β i 0 can be simplified to C α i 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found 18

Learning nonlinear decision boundary Linearly separable Nonlinearly separable The XOR gate Speech recognition 19

A decision tree for Tax Fraud Input: a vector of attributes X = [Refund,MarSt,TaxInc] Output: Y= Cheating or Not H as a procedure: Yes Refund No Single, Divorced MarSt Married Each internal node: test one attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y TaxInc < 80K > 80K YES 20

10 Apply model to test data I Start from the root of tree. Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 21

10 Apply model to test data II Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 22

10 Apply model to test data III Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 23

10 Apply model to test data IV Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No MarSt Single, Divorced Married TaxInc < 80K > 80K YES 24

10 Apply model to test data V Query Data R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Refund No M a rrie d 80K? Yes No Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES 25

Expressiveness of decision tree Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. Prefer to find more compact decision trees 26

Hypothesis spaces (model space How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set may get worse predictions 27

10 10 Decision tree learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? Apply Model Model Decision Tree 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 28

10 Example of a decision tree T id R e fu n d M a rita l S ta tu s T a x a b le In c o m e C h e a t Splitting Attributes 1 Y e s S in g le 1 2 5 K No 2 No M a rrie d 1 0 0 K No 3 No S in g le 70K No Yes Refund No 4 Y e s M a rrie d 1 2 0 K No 5 No D iv o rc e d 95K Y e s 6 No M a rrie d 60K No 7 Y e s D iv o rc e d 2 2 0 K No 8 No S in g le 85K Y e s MarSt Single, Divorced TaxInc < 80K > 80K Married 9 No M a rrie d 75K No 10 No S in g le 90K Y e s YES Training Data Model: Decision Tree 29

10 Another example of a decision tree T id R e fu n d M a rita l T a x a b le S ta tu s In c o m e C h e a t 1 Y e s S in g le 1 2 5 K No 2 No M a rrie d 1 0 0 K No Married MarSt Yes Single, Divorced Refund No 3 No S in g le 70K No TaxInc 4 Y e s M a rrie d 1 2 0 K No < 80K > 80K 5 No D iv o rc e d 95K Y e s 6 No M a rrie d 60K No 7 Y e s D iv o rc e d 2 2 0 K No YES 8 No S in g le 85K Y e s 9 No M a rrie d 75K No 10 No S in g le 90K Y e s There could be more than one tree that fits the same data! Training Data 30

Top-Down Induction of Decision tree Main loop: A the best decision attribute for next node Assign A as the decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, then STOP; ELSE iterate over new leaf nodes 31

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting 32

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports}

Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Small Size Medium Large Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} OR {Medium, Large} Size {Small}

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < t) or (A t) consider all possible splits and finds the best cut can be more compute intensive Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split

How to determine the Best Split Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Homogeneous, Low degree of impurity Non-homogeneous, High degree of impurity Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity

How to compare attribute? Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Information theory: Most efficient code assigns -log 2 P(X=i) bits to encode the message X=I, So, expected number of bits to code one random X is:

Sample Entropy S is a sample of training examples p + is the proportion of positive examples in S p - is the proportion of negative examples in S Entropy measure the impurity of S

Examples for computing Entropy C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log 0 1 log 1 = 0 0 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (1/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92

How to compare attribute? Conditional Entropy of variable X given variable Y Given specific Y=v entropy H(X Y=v) of X: Conditional entropy H(X Y) of X: average of H(X Y=v) Mutual information (aka information gain) of X given Y :

Information Gain Information gain (after split a node): GAIN split Entropy ( p ) k i 1 n n i Entropy ( i ) n samples in parent node p is split into k partitions; n i is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

Problem of splitting using information gain Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Gain Ratio: GainRATIO GAIN Split SplitINFO split SplitINFO k i 1 n n i log n n i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain 42

Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later)

Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz