Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Similar documents
Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Data Mining on Streams

Decision-Tree Learning

Classification and Prediction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data mining techniques: decision trees

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

The KDD Process: Applying Data Mining

Data Mining for Knowledge Management. Classification

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Data Mining: Foundation, Techniques and Applications

Introduction to Learning & Decision Trees

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Decision Trees from large Databases: SLIQ

Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

Lecture 10: Regression Trees

Data Mining Classification: Decision Trees

Chapter 12 Discovering New Knowledge Data Mining

Knowledge Discovery and Data Mining

Data Mining Practical Machine Learning Tools and Techniques

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Full and Complete Binary Trees

Social Media Mining. Data Mining Essentials

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Professor Anita Wasilewska. Classification Lecture Notes

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Big Data: The Science of Patterns. Dr. Lutz Hamel Dept. of Computer Science and Statistics

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Performance Analysis of Decision Trees

Data Mining Techniques Chapter 6: Decision Trees

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

D A T A M I N I N G C L A S S I F I C A T I O N

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Data Mining. Nonlinear Classification

Spam Detection A Machine Learning Approach

Data Mining with Weka

Version Spaces.

Knowledge Discovery and Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Neural Networks and Support Vector Machines

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

8. Machine Learning Applied Artificial Intelligence

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Experiments in Web Page Classification for Semantic Web

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

CONSTRUCTING PREDICTIVE MODELS TO ASSESS THE IMPORTANCE OF VARIABLES IN EPIDEMIOLOGICAL DATA USING A GENETIC ALGORITHM SYSTEM EMPLOYING DECISION TREES

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

A Decision Tree Classification Model for University Admission System

A Survey of Classification Techniques in the Area of Big Data.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Mauro Sousa Marta Mattoso Nelson Ebecken. and these techniques often repeatedly scan the. entire set. A solution that has been used for a

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

FUNDAMENTALS OF MACHINE LEARNING FOR PREDICTIVE DATA ANALYTICS Algorithms, Worked Examples, and Case Studies

Interactive Exploration of Decision Tree Results

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Advanced Techniques for the Walkingbass

(Refer Slide Time: 2:03)

Knowledge-based systems and the need for learning

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining of Web Access Logs

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

L25: Ensemble learning

File Management. Chapter 12

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Homework Assignment 7

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Data Mining - Evaluation of Classifiers

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

Classification/Decision Trees (II)

Data Preprocessing. Week 2

TREE BASIC TERMINOLOGIES

Query Processing C H A P T E R12. Practice Exercises

Downloading Your Financial Statements to Excel

Reading 13 : Finite State Automata and Regular Expressions

CAs and Turing Machines. The Basis for Universal Computation

Task 2 Multi-Text Reading: Holidays and Travel

Knowledge Discovery and Data Mining

Mathematical Induction

ProM 6 Exercises. J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl. August 2010

First, we ll look at some basics all too often the things you cannot change easily!

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes

Logic in Computer Science: Logic Gates

Text Analytics Illustrated with a Simple Data Set

EFFICIENT CLASSIFICATION OF BIG DATA USING VFDT (VERY FAST DECISION TREE)

Classification On The Clouds Using MapReduce

Data Mining Part 5. Prediction

Discuss the size of the instance for the minimum spanning tree problem.

Pigeonhole Principle Solutions

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Transcription:

Supervised Learning (contd) Decision Trees Mausam (based on slides by UW-AI faculty)

Decision Trees To play or not to play? http://www.sfgate.com/blogs/images/sfgate/sgreen/2007/09/05/2240773250x321.jpg 2

Example data for learning the concept Good day for tennis Day Outlook Humid Wind PlayTennis? d1 s h w n d2 s h s n d3 o h w y d4 r h w y d5 r n w y d6 r n s y d7 o n s y d8 s h w n d9 s n w y d10 r n w y d11 s n s y d12 o h s y d13 o n w y d14 r h s n Outlook = sunny, overcast, rain Humidity = high, normal Wind = weak, strong 3

A Decision Tree for the Same Data Decision Tree for PlayTennis? Leaves = classification output Outlook Arcs = choice of value for parent attribute Sunny Overcast Rain Humidity Yes Wind Normal High Strong Weak Yes No Decision tree is equivalent to logic in disjunctive normal form PlayTennis (Sunny Normal) Overcast (Rain Weak) No Yes 4

Example: Decision Tree for Continuous Valued Features and Discrete Output Input real number attributes (x1,x2), Classification output: 0 or 1 x 2 How do we branch using attribute values x1 and x2 to partition the space correctly? x1 6

Example: Classification of Continuous Valued Inputs x2 Decision Tree 3 4 x1 7

Expressiveness of Decision Trees Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row = path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example But most likely won't generalize to new examples Prefer to find more compact decision trees 8

Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 9

Example Decision tree A decision tree for Wait? based on personal rules of thumb : 10

Input Data for Learning Past examples when I did/did not wait for a table: Classification of examples is positive (T) or negative (F) 11

Decision Tree Learning Aim: find a small tree consistent with training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree 12

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty E.g., splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice For Type?, to wait or not to wait is still at 50% 13

How do we quantify uncertainty? http://a.espncdn.com/media/ten/2006/0306/photo/g_mcenroe_195.jpg

Using information theory to quantify uncertainty Entropy measures the amount of uncertainty in a probability distribution Entropy (or Information Content) of an answer to a question with possible answers v 1,, v n : I(P(v 1 ),, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) 15

Using information theory Imagine we have p examples with Wait = True (positive) and n examples with Wait = false (negative). Our best estimate of the probabilities of Wait = true or false is given by: P ( true) p / p n p( false) n / p n Hence the entropy of Wait is given by: I( p p n, n p ) n p p n p n log 2 log 2 p n p n n p n 16

Entropy I 1.0 0.5 Entropy is highest when uncertainty is greatest Wait = F Wait = T.00.50 1.00 P(Wait = T) 17

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty and result in gain in information How much information do we gain if we disclose the value of some attribute? Answer: uncertainty before uncertainty after 18

Back at the Restaurant Before choosing an attribute: Entropy = - 6/12 log(6/12) 6/12 log(6/12) = - log(1/2) = log(2) = 1 bit There is 1 bit of information to be discovered 19

Back at the Restaurant If we choose Type: Go along branch French : we have entropy = 1 bit; similarly for the others. Information gain = 1-1 = 0 along any branch If we choose Patrons: In branch None and Some, entropy = 0 For Full, entropy = -2/6 log(2/6)-4/6 log(4/6) = 0.92 Info gain = (1-0) or (1-0.92) bits > 0 in both cases So choosing Patrons gains more information! 20

Entropy across branches How do we combine entropy of different branches? Answer: Compute average entropy Weight entropies according to probabilities of branches 2/12 times we enter None, so weight for None = 1/6 Some has weight: 4/12 = 1/3 Full has weight 6/12 = ½ n p i ni pi ni AvgEntropy ( A) Entropy (, ) p n p n p n i 1 i i i i weight for each branch entropy for each branch 21

Information gain Information Gain (IG) or reduction in entropy from using attribute A: IG(A) = Entropy before AvgEntropy after choosing A Choose the attribute with the largest IG 22

Information gain in our example 2 4 IG( Patrons ) 1 [ I(0,1) 12 12 2 1 1 2 IG( Type) 1 [ I(, ) I 12 2 2 12 I(1,0) 1 ( 2 1, ) 2 6 2 I(, 12 6 4 2 I(, 12 4 4 )] 6 2 ) 4.541 bits 4 12 I 2 ( 4, 2 )] 4 0 bits Patrons has the highest IG of all attributes DTL algorithm chooses Patrons as the root 23

Should I stay or should I go? Learned Decision Tree Decision tree learned from the 12 examples: Substantially simpler than rules-of-thumb tree more complex hypothesis not justified by small amount of data 24

Performance Evaluation How do we know that the learned tree h f? Answer: Try h on a new test set of examples Learning curve = % correct on test set as a function of training set size 25

Overfitting 0.9 0.8 Accuracy On training data On test data 0.7 0.6 Number of Nodes in Decision tree 26

27

Rule #2 of Machine Learning The best hypothesis almost never achieves 100% accuracy on the training data. (Rule #1 was: you can t learn anything without inductive bias) 28

Avoiding overfitting Stop growing when data split not statistically significant Grow full tree and then prune How to select best tree? Measure performance over the training data Measure performance over separate validation set Add complexity penalty to performance measure 29

0.9 Accuracy Early Stopping Remember this tree and use it as the final classifier On training data 0.8 0.7 On validation data On test data 0.6 Number of Nodes in Decision tree 30

Tune Tune Tune Test Reduced Error Pruning Split data into train and validation set Repeat until pruning is harmful Remove each subtree and replace it with majority class and evaluate on validation set Remove subtree that leads to largest gain in accuracy 31

Reduced Error Pruning Example Sunny Outlook Overcast Rain High Humidity Low Play Wind Strong Weak Don t play Play Don t play Play Validation set accuracy = 0.75 32

Reduced Error Pruning Example Sunny Don t play Outlook Overcast Play Rain Wind Strong Weak Don t play Play Validation set accuracy = 0.80 33

Reduced Error Pruning Example High Sunny Humidity Low Outlook Overcast Play Rain Play Don t play Play Validation set accuracy = 0.70 34

Reduced Error Pruning Example Sunny Don t play Outlook Overcast Play Rain Wind Strong Weak Don t play Play Use this as final tree 35

Post Rule Pruning Split data into train and validation set Prune each rule independently Remove each pre-condition and evaluate accuracy Pick pre-condition that leads to largest improvement in accuracy Note: ways to do this using training data and statistical tests 36

Conversion to Rule Sunny Outlook Overcast Rain High Humidity Low Play Wind Strong Weak Don t play Play Don t play Play Outlook = Sunny Humidity = High Don t play Outlook = Sunny Humidity = Low Play Outlook = Overcast Play 37

Scaling Up ID3 and C4.5 assume data fits in main memory (ok for 100,000s examples) SPRINT, SLIQ: multiple sequential scans of data (ok for millions of examples) VFDT: at most one sequential scan (ok for billions of examples) 38

Decision Trees Strengths Very Popular Technique Fast Useful when Target Function is discrete Concepts are likely to be disjunctions Attributes may be noisy 39

Decision Trees Weaknesses Less useful for continuous outputs Can have difficulty with continuous input features as well E.g., what if your target concept is a circle in the x1, x2 plane? Hard to represent with decision trees Very simple with instance-based methods we ll discuss later 40