Introduction to Machine Learning Lecture 10. Mehryar Mohri Courant Institute and Google Research

Similar documents
Gerry Hobbs, Department of Statistics, West Virginia University

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Data Mining Classification: Decision Trees

Decision Trees from large Databases: SLIQ

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Decision-Tree Learning

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Decision Tree Learning on Very Large Data Sets

Lecture 10: Regression Trees

Data Mining Techniques Chapter 6: Decision Trees

How To Make A Credit Risk Model For A Bank Account

Data mining techniques: decision trees

An Overview and Evaluation of Decision Tree Methodology

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

6 Classification and Regression Trees, 7 Bagging, and Boosting

Data Mining Practical Machine Learning Tools and Techniques

Efficiency in Software Development Projects

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining - Evaluation of Classifiers

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Data Mining for Knowledge Management. Classification

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Classification and Prediction

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Data Mining. Nonlinear Classification

Data Mining Methods: Applications for Institutional Research

Weather forecast prediction: a Data Mining application

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

REPORT DOCUMENTATION PAGE

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Performance Analysis of Decision Trees

A Decision Theoretic Approach to Targeted Advertising

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

Data Mining Part 5. Prediction

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

A Data Mining Tutorial

Classification and Regression Trees (CART) Theory and Applications

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

International Journal of Computer Sciences and Engineering Open Access

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

D-optimal plans in observational studies

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Benchmarking Open-Source Tree Learners in R/RWeka

DATA MINING METHODS WITH TREES

Social Media Mining. Data Mining Essentials

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Logistic Model Trees

Using Random Forest to Learn Imbalanced Data

Applied Multivariate Analysis - Big data analytics

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

The Operational Value of Social Media Information. Social Media and Customer Interaction

Introduction to Learning & Decision Trees

Model Trees for Classification of Hybrid Data Types

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Predictive Modeling of Titanic Survivors: a Learning Competition

Comparative Analysis of Serial Decision Tree Classification Algorithms

Fast Analytics on Big Data with H20

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Tree Structured Data Analysis: AID, CHAID and CART

Selecting Data Mining Model for Web Advertising in Virtual Communities

Chapter 12 Discovering New Knowledge Data Mining

REVIEW OF ENSEMBLE CLASSIFICATION

Model Combination. 24 Novembre 2009

Using multiple models: Bagging, Boosting, Ensembles, Forests

Ensemble of Decision Tree Classifiers for Mining Web Data Streams

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

Knowledge Discovery and Data Mining

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data mining is used to develop models for the early prediction of freshmen GPA. Since

CART: Classification and Regression Trees

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Classification tree analysis using TARGET

Course Syllabus. Purposes of Course:

Cluster Description Formats, Problems and Algorithms

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Knowledge Discovery and Data Mining

A Hybrid Decision Tree Approach for Semiconductor. Manufacturing Data Mining and An Empirical Study

CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Chapter 20: Data Analysis

Data quality in Accounting Information Systems

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Monday Morning Data Mining

ASSESSING DECISION TREE MODELS FOR CLINICAL IN VITRO FERTILIZATION DATA

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Simple and efficient online algorithms for real world applications

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Transcription:

Introduction to Machine Learning Lecture 10 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Decision Trees Mehryar Mohri - Foundations of Machine Learning 2

Supervised Learning Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, classification: Card(Y )=k. regression: Y R. Problem: find classifier h: X Y in H with small generalization error. X 3

Advantages Interpretation: explain complex data, result easy to analyze and understand. Adaptation: easy to update to new data. Different types of variables: categorical, numerical. Monotone transformation invariance: measuring unit is not a concern. Dealing with missing labels. But: beware of interpretation! 4

Example - Playing Golf play humidity <= 40% (77/423) play wind <= 12 mph (20/238) do not play visibility > 10 mi (57/185) play (19/236) do not play (1/2) play barometer <= 18 in (48/113) do not play (9/72) Misclassification rates are indicated at each node. play (37/101) do not play (1/12) 5

Decision Trees X1 < a1 X 2 YES NO R 2 X1 < a2 X2 < a3 X2 < a4 R3 R4 R5 YES NO R1 R2 YES NO YES NO a 4 R 1 R 3 a 2 a 1 R 5 a 3 R 4 X 1 6

Different Types of Questions Decision trees X {blue, white, red} : categorical questions. X a : continuous variables. Binary space partition (BSP) trees: n i=1 α ix i a : partitioning with convex polyhedral regions. Sphere trees: X a 0 a : partitioning with pieces of spheres. 7

In each region R t Prediction (tree leaf): classification: majority vote - ties broken arbitrarily. y t =argmax y Y regression: average value. {x i R t : i [1,m],y i = y}. y t = 1 R t x i R t y i. for confident predictions, need enough points in each region. 8

Learning How to build a decision tree from data: choose question, e.g., x 3, yielding best purity. partition data into corresponding subsets. reiterate with resulting subsets. stop when regions are approximately pure. 9

Impurity Criteria - Classification Binary case: p fraction of positive instances. misclassification: F (p)=min(p, 1 p). entropy: Gini index: F (p)= p log 2 (p) (1 p)log 2 (1 p). F (p)=2p (1 p). 10

Impurity Criteria - Regression Mean squared error: Other similar F (R)= 1 R L p x i R norm criteria. (y i y) 2. 11

Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. for all,, j [1,N] θ R R + (j, θ)={x i R: x i [j] θ, i [1,m]} R (j, θ)={x i R: x i [j]<θ,i [1,m]}. Decision-Trees(S =((x 1,y 1 ),...,(x m,y m ))) 1 P {S} initial partition 2 for each region R P such that Pred(R) do 3 (j, θ) argmin (j,θ) error(r (j, θ)) + error(r + (j, θ)) 4 P P R {R (j, θ),r + (j, θ)} 5 return P 12

Overfitting Problem: size of tree? tree must be large enough to fit the data. tree must be small enough not to overfit. minimizing training error or impurity does not help. Theory: generalization bound. R(h) R(h)+O complexity measure m minimize (impurity + α tree ).. 13

Controlling Size of Tree Grow-then-prune strategy (CART): create very large tree. prune back according to some criterion. Pruning criteria: (impurity + α tree ). α determined by cross-validation. 14

Categorical Variables Problem: with N possible unordered variables, e.g., color (blue, white, red), there are 2 N 1 1 possible partitions. Solution (when only two possible outcomes): sort variables according to the number of 1s in each, e.g., white.9, red.45, blue.3. Split predictor as with ordered variables. 15

Missing Values Problem: points x with missing values y, due to: the proper measurement not taken, a source causing the absence of labels. Solution: categorical case: create new category missing; use surrogate variables: use only those variables that are available for a split. 16

Instability Problem: high variance small changes in the data may lead to very different splits, price to pay for the hierarchical nature of decision trees, more stable criteria could be used. 17

Decision Tree Tools Most commonly used tools for learning decision trees: CART (classification and regression tree) (Breiman et al., 1984). C4.5 (Quinlan, 1986, 1993) and C5.0 (RuleQuest Research) a commercial system. Differences: minor between latest versions. 18

Summary Straightforward to train. Easily interpretable (modulo instability). Often not best results in practice. boosting decision trees (next lecture). 19

References Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen. Classifications and Regression Trees. Chapman & Hall, 1984. Luc Devroye, Laszlo Gyorfi, Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. Quinlan, J. R. Induction of Decision Trees, in Machine Learning, Volume 1, s 81-106, 1986. Mehryar Mohri - Foundations of Machine Learning 20 Courant Institute, NYU