Introduction to Learning & Decision Trees



Similar documents
Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Data Mining Classification: Decision Trees

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Lecture 10: Regression Trees

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Course 395: Machine Learning

Data Mining for Knowledge Management. Classification

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Classification and Prediction

5. Introduction to Data Mining

Linear Threshold Units

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Professor Anita Wasilewska. Classification Lecture Notes

Decision Trees from large Databases: SLIQ

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Introduction. 5. Introduction to Data Mining. Data Mining. 5.1 Introduction. Introduction. Introduction. Association rules

Data Mining Techniques Chapter 6: Decision Trees

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Social Media Mining. Data Mining Essentials

Question 2 Naïve Bayes (16 points)

Decision-Tree Learning

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Foundations of Artificial Intelligence. Introduction to Data Mining

Data mining techniques: decision trees

Chapter 7. Feature Selection. 7.1 Introduction

Entropy and Mutual Information

Ex. 2.1 (Davide Basilio Bartolini)

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Machine Learning and Statistics: What s the Connection?

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Chapter 12 Discovering New Knowledge Data Mining

Chapter 20: Data Analysis

Data Mining: An Overview. David Madigan

Learning is a very general term denoting the way in which agents:

Data Mining Techniques

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

An Introduction to Information Theory

LCs for Binary Classification

National Sun Yat-Sen University CSE Course: Information Theory. Gambling And Entropy

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Data Mining - Evaluation of Classifiers

Beating the MLB Moneyline

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Full and Complete Binary Trees

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Big Data Analytics CSCI 4030

Decision Trees and Random Forests. Reference: Leo Breiman,

Less naive Bayes spam detection

FUNDAMENTALS OF MACHINE LEARNING FOR PREDICTIVE DATA ANALYTICS Algorithms, Worked Examples, and Case Studies

Web Document Clustering

Introduction to Machine Learning Using Python. Vikram Kamath

The Artificial Prediction Market

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

MA2823: Foundations of Machine Learning

Spark: Cluster Computing with Working Sets

: Introduction to Machine Learning Dr. Rita Osadchy

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Data, Measurements, Features

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Data Mining Algorithms Part 1. Dejan Sarka

Lecture 1: Course overview, circuits, and formulas

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Clustering Via Decision Tree Construction

Machine Learning. CS494/594, Fall :10 AM 12:25 PM Claxton 205. Slides adapted (and extended) from: ETHEM ALPAYDIN The MIT Press, 2004

Introduction to Pattern Recognition

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Lecture Slides for INTRODUCTION TO. ETHEM ALPAYDIN The MIT Press, Lab Class and literature. Friday, , Harburger Schloßstr.

Decompose Error Rate into components, some of which can be measured on unlabeled data

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Introduction to Automata Theory. Reading: Chapter 1

Data quality in Accounting Information Systems

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Introduction to Data Mining

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Lecture 8 February 4

Comparison of Data Mining Techniques used for Financial Data Analysis

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Mathematical Induction

Supervised Learning (Big Data Analytics)

Classification algorithm in Data mining: An Overview

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Active Learning with Boosting for Spam Detection

Transcription:

Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing facts - learning the underlying structure of the problem or data aka: regression pattern recognition machine learning data mining A fundamental aspect of learning is generalization: - given a few examples, can you generalize to others? Learning is ubiquitous: - medical diagnosis: identify new disorders from observations - loan applications: predict risk of default - prediction: (climate, stocks, etc.) predict future from current and past data - speech/object recognition: from examples, generalize to others 2

Representation How do we model or represent the world? All learning requires some form of representation. Learning: adjust model parameters to match data model {θ,..., θ n } world (or data) 3 The complexity of learning Fundamental trade-off in learning: complexity of model vs amount of data required to learn parameters The more complex the model, the more it can describe, but the more data it requires to constrain the parameters. Consider a hypothesis space of N models: - How many bits would it take to identify which of the N models is correct? - log2(n) in the worst case Want simple models to explain examples and generalize to others - Ockham s (some say Occam) razor 4

Complex learning example: curve fitting t = sin(2πx) + noise t t n 0 y(x n, w) 0 x n x How do we model the data? 5 example from Bishop (2006), Pattern Recognition and Machine Learning Polynomial curve fitting M y(x, w) = w 0 + w x + w 2 x 2 + + w M x M = w j x j j=0 E(w) = N [y(x n, w) t n ] 2 2 n= 0 0 0 0 0 0 0 6 0 example from Bishop (2006), Pattern Recognition and Machine Learning

More data are needed to learn correct model 0 0 0 0 This is overfitting. 7 example from Bishop (2006), Pattern Recognition and Machine Learning supervised Types of learning unsupervised reinforcement reinforcement desired output {y,..., y n } model output model {θ,..., θ n } model {θ,..., θ n } model {θ,..., θ n } world (or data) world (or data) world (or data) 8

Decision Trees Decision trees: classifying from a set of attributes <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y Y N Y N N Y Y N bad: 0 good: 3 N bad: good: 6 <2 years at current job? bad: 3 good: 7 missed payments? Y bad: good: 3 Y bad: 2 good: each level splits the data according to different attributes goal: achieve perfect classification with minimal number of decisions - not always possible due to noise or inconsistencies in the data 0

Observations Any boolean function can be represented by a decision tree. not good for all functions, e.g.: - parity function: return iff an even number of inputs are - majority function: return if more than half inputs are best when a small number of attributes provide a lot of information Note: finding optimal tree for arbitrary data is NP-hard. Decision trees with continuous values years at current job Predicting credit risk # missed payments defaulted? 7 0 N 0.75 0 Y 3 0 N 9 0 N 4 2 Y 0.25 0 N 5 N 8 4 Y.0 0 N.75 0 N # missed payments >.5 " "" "" " " " years at current job Now tree corresponds to order and placement of boundaries General case: - arbitrary number of attributes: binary, multi-valued, or continuous - output: binary, multi-valued (decision or axis-aligned classification trees), or continuous (regression trees) 2

Examples loan applications medical diagnosis movie preferences (Netflix contest) spam filters security screening many real-word systems, and AI success In each case, we want - accurate classification, i.e. minimize error - efficient decision making, i.e. fewest # of decisions/tests decision sequence could be further complicated - want to minimize false negatives in medical diagnosis or minimize cost of test sequence - don t want to miss important email 3 Decision Trees simple example of inductive learning. learn decision tree from training examples 2. predict classes for novel testing examples class prediction model {θ,..., θ n } Generalization is how well we do on the testing examples. Only works if we can learn the underlying structure of the data. training examples testing examples 4

Choosing the attributes How do we find a decision tree that agrees with the training data? Could just choose a tree that has one path to a leaf for each example - but this just memorizes the observations (assuming data are consistent) - we want it to generalize to new examples Ideally, best attribute would partition the data into positive and negative examples Strategy (greedy): - choose attributes that give the best partition first Want correct classification with fewest number of tests 5 Problems How do we which attribute or value to split on? When should we stop splitting? What do we do when we can t achieve perfect classification? What if tree is too large? Can we approximate with a smaller tree? Predicting credit risk <2 years at current job? missed payments? defaulted? Y N Y N bad: 3 good: 7 missed payments? Y N Y Y N Y N N Y Y N bad: 0 good: 3 bad: good: 6 <2 years at current job? Y bad: good: 3 bad: 2 good: 6

Basic algorithm for learning decision trees. starting with whole training data 2. select attribute or value along dimension that gives best split 3. create child nodes based on split 4. recurse on each child using child data until a stopping criterion is reached all examples have same class amount of data is too small tree too large Central problem: How do we choose the best attribute? 7 Measuring information A convenient measure to use is based on information theory. How much information does an attribute give us about the class? - attributes that perfectly partition should given maximal information - unrelated attributes should give no information Information of symbol w: I(w) log 2 P (w) P (w) = /2 I(w) = log 2 /2 = bit P (w) = /4 I(w) = log 2 /4 = 2 bits 8

Information and Entropy I(w) log 2 P (w) For a random variable X with probability P(x), the entropy is the average (or expected) amount of information obtained by observing x: H(X) = x P (x)i(x) = x P (x) log 2 P (x) Note: H(X) depends only on the probability, not the value. H(X) quantifies the uncertainty in the data in terms of bits H(X) gives a lower bound on cost (in bits) of coding (or describing) X H(X) = x P (x) log 2 P (x) P (heads) = /2 2 log 2 P (heads) = /3 3 log 2 2 2 log 2 2 = bit 3 2 3 log 2 2 = 0.983 bits 3 9 Entropy of a binary random variable Example Entropy is maximum at p=0.5 Entropy is zero and p=0 or p=. A single random variable X with X = with probability p and X = 0 with probability p. Note that H(p) is bit when p = /2. 20

English character Example strings revisited: of symbols A-Z and in space english: A-Z and space H = 4.76 bits/char H2 = 4.03 bits/char H2 = 2.8 bits/char A fourth order approximation The entropy increases as the data become less ordered. 2 Computational Perception and Scene Analysis: Visual Coding 2 / Michael S. Lewicki, CMU Note that as the order is increased: entropy decreases: H 0 = 4.76 bits, H = 4.03 bits, and H 4 = 2.8 bits/char variables, i.e. P (c i c i, c i 2,..., c i k ), specify more specific structure generated samples look more like real English Predicting credit risk How many bits does This it is take an to example specify of the relationship <2 years between at missed efficient coding and representatio the attribute of defaulted? defaulted? current job? payments? of signal structure. - P(defaulted = Y) = 3/0 - P(defaulted = N) = 7/0 Y N Y Computational Perception and Scene Analysis: Visual Coding 2 / Michael S. Lewicki, CMU? H(Y ) = P (Y = y i ) log 2 P (Y = y i ) Credit risk revisited i=y,n = 0.3 log 2 0.3 0.7 log 2 0.7 = 0.883 How much can we reduce the entropy (or uncertainty) of defaulted by knowing the other attributes? Ideally, we could reduce it to zero, in which case we classify perfectly. N Y Y N Y N N Y Y? 22

Conditional entropy H(Y X) is the remaining entropy of Y given X or The expected (or average) entropy of P(y x) H(Y X) x P (x) y P (y x) log 2 P (y x) = x P (x) y P (Y = y X = x) log 2 P (Y = y X = x) = x P (x) y H(Y X = x) H(Y X=x) is the specific conditional entropy, i.e. the entropy of Y knowing the value of a specific attribute x. 23 Back to the credit risk example Predicting credit risk H(Y X) x = x = x P (x) y P (x) y P (x) y P (y x) log 2 P (y x) P (Y = y X = x) log 2 P (Y = y X = x) H(Y X = x) H(defaulted < 2years = N) = 4 4 + 2 log 2 H(defaulted < 2years = Y) = 3 4 log 2 4 4 + 2 2 6 log 2 3 4 4 log 2 4 = 0.833 H(defaulted missed) = 6 0 0.983 + 4 0.833 = 0.8763 0 2 6 = 0.983 <2 yrs missed def? Y N Y N Y Y N Y N N Y Y H(defaulted missed = N) = 6 7 log 6 2 7 7 log 2 7 = 0.597 H(defaulted missed = Y) = 3 log 2 3 2 3 log 2 2 3 = 0.983 H(defaulted missed) = 7 0 0.597 + 3 0.983 = 0.6897 0 24

Mutual information We now have the entropy - the minimal number of bits required to specify the target attribute: H(Y ) = y P (y) log 2 P (y) The conditional entropy - the remaining entropy of Y knowing X H(Y X) = x P (x) y P (y x) log 2 P (y x) So we can now define the reduction of the entropy after learning Y. This is known as the mutual information between Y and X I(Y ; X) = H(Y ) H(Y X) 25 Properties of mutual information Mutual information is symmetric I(Y ; X) = I(X; Y ) In terms of probability distributions, it is written as I(X; Y ) = x,y P (x, y) log 2 P (x, y) P (x)p (y) It is zero, if Y provides no information about X: I(X; Y ) = 0 P (x) and P (y) are independent If Y = X then I(X; X) = H(X) H(X X) = H(X) 26

Information gain H(defaulted) H(defaulted < 2 years) 0.883 0.8763 = 0.0050 H(defaulted) H(defaulted missed) 0.883 0.6897 = 0.96 bad: 3 good: 7 Missed payments are the most informative attribute about defaulting. N bad: good: 6 missed payments? Y bad: 2 good: N bad: 0 good: 3 <2 years at current job? Y bad: good: 3 27 A small dataset: Miles Per Gallon Example (from Andrew Moore): Predicting miles per gallon http://www.autonlab.org/tutorials/dtree.html mpg cylinders displacement horsepower weight acceleration modelyear maker 40 Records good 4 low low low high 75to78 asia bad 6 medium medium medium medium 70to74 america bad 4 medium medium medium low 75to78 europe bad 8 high high high low 70to74 america bad 6 medium medium medium medium 70to74 america bad 4 low medium low medium 70to74 asia bad 4 low medium low low 70to74 asia bad 8 high high high low 75to78 america : : : : : : : : : : : : : : : : : : : : : : : : bad 8 high high high low 70to74 america good 8 high medium high high 79to83 america bad 8 high high high low 75to78 america good 4 low low low low 79to83 america bad 6 medium medium medium high 75to78 america good 4 medium low low low 79to83 america good 4 low low medium high 79to83 america bad 8 high high high low 70to74 america good 4 low medium low medium 75to78 europe bad 5 medium medium medium medium 75to78 europe 28 From the UCI repository (thanks to Ross Quinlan)

First step: calculate information gains Compute for information gain for each Suppose attribute we want to In this predict case, cylinders MPG. provides the most gain, because it nearly partitions the data. Look at all the information gains 29 Copyright Andrew W. Moore Slide 32 First decision: partition on cylinders A Decision Stump Note the lopsided mpg class distribution. Copyright Andrew W. Moore Slide 33 30

Recursion Step Recurse on child nodes to expand tree Records in which cylinders = 4 Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Copyright Andrew W. Moore Slide 34 3 Expanding the tree: data is partitioned for each child Recursion Step Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records.. Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Exactly the same, but with a smaller, conditioned datasets. Copyright Andrew W. Moore Slide 35 32

Second level of decisions Second level of tree Why don t we expand these nodes? Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases) 33 Copyright Andrew W. Moore Slide 36 The final tree Copyright Andrew W. Moore 34 Slide 37