Classification and Regression Trees

From this document you will learn the answers to the following questions:

What is the name of the tree that we want to prune a big tree?

What is a subtree T T 0?

What type of node corresponds to a region corresponds to a particular prediction?

Similar documents
Lecture 10: Regression Trees

Data Mining Classification: Decision Trees

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining Techniques Chapter 6: Decision Trees

Classification/Decision Trees (II)

Gerry Hobbs, Department of Statistics, West Virginia University

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data mining techniques: decision trees

6 Classification and Regression Trees, 7 Bagging, and Boosting

An Overview and Evaluation of Decision Tree Methodology

Decision-Tree Learning

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Data Mining Methods: Applications for Institutional Research

Decision Trees from large Databases: SLIQ

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Full and Complete Binary Trees

Lecture 3: Linear methods for classification

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Introduction to Learning & Decision Trees

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Data Mining for Knowledge Management. Classification

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Analysis of Algorithms I: Binary Search Trees

Data Mining Practical Machine Learning Tools and Techniques

Classification tree analysis using TARGET

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Social Media Mining. Data Mining Essentials

How To Build A Predictive Model In Insurance

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

How To Make A Credit Risk Model For A Bank Account

Statistical Machine Learning

Fast Analytics on Big Data with H20

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Classification and Prediction

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

A Non-Linear Schema Theorem for Genetic Algorithms

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

D-optimal plans in observational studies

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

From Last Time: Remove (Delete) Operation

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Classification and Regression Trees

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

Data Mining Part 5. Prediction

Penalized Logistic Regression and Classification of Microarray Data

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

A Decision Theoretic Approach to Targeted Advertising

Data mining is used to develop models for the early prediction of freshmen GPA. Since

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Efficiency in Software Development Projects

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Generalized Boosted Models: A guide to the gbm package

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

A fast, powerful data mining workbench designed for small to midsize organizations

Spam Detection A Machine Learning Approach

Studying Auto Insurance Data

STA 4273H: Statistical Machine Learning

Smart Sell Re-quote project for an Insurance company.

Data Mining - Evaluation of Classifiers

Visualizing class probability estimators

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

DATA MINING WITHIN A REGRESSION FRAMEWORK

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Risk pricing for Australian Motor Insurance

Classification and Regression Trees (CART) Theory and Applications

Weight of Evidence Module

Binary Search Trees CMPSC 122

Knowledge Discovery and Data Mining

Data Mining. Nonlinear Classification

Better credit models benefit us all

Local classification and local likelihoods

Clustering UE 141 Spring 2013

Data Preprocessing. Week 2

Data Structures Fibonacci Heaps, Amortized Analysis

Data mining and statistical models in marketing campaigns of BT Retail

THE LAST THING A FISH NOTICES IS THE WATER IN WHICH IT SWIMS COMPETITIVE MARKET ANALYSIS: AN EXAMPLE FOR MOTOR INSURANCE PRICING RISK

Biological Forum An International Journal 7(1): (2015)

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

SURRENDER TRIGGERS IN LIFE INSURANCE : CLASSIFICATION AND RISK PREDICTIONS

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Collaborative Filtering. Radek Pelánek

Exploratory Data Analysis using Random Forests

Decision Trees for Predictive Modeling

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Transcription:

Classification and Regression Trees David Rosenberg New York University February 28, 2015 David Rosenberg (New York University) DS-GA 1003 February 28, 2015 1 / 36

Regression Trees General Tree Structure From Criminisi et al. MSR-TR-2011-114, 28 October 2011. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 2 / 36

Regression Trees Decision Tree From Criminisi et al. MSR-TR-2011-114, 28 October 2011. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 3 / 36

Regression Trees Binary Decision Tree on R 2 Consider a binary tree on {(X 1,X 2 ) X 1,X 2 R} From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 4 / 36

Regression Trees Binary Regression Tree on R 2 Consider a binary tree on {(X 1,X 2 ) X 1,X 2 R} From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 5 / 36

Regression Trees Fitting a Regression Tree The decision tree gives the partition of X into regions: {R 1,...,R M }. Recall that a partition is a disjoint union, that is: X = R 1 R 2 R M and R i R j = i j avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 6 / 36

Regression Trees Fitting a Regression Tree Given the partition {R 1,...,R M }, final prediction is M f (x) = c m 1(x R m ) m=1 How to choose c 1,...,c M? For loss function l(ŷ,y) = (ŷ y) 2, best is ĉ m = ave(y i x i R m ). avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 7 / 36

Regression Trees Complexity of a Tree Let T = M denote the number of terminal nodes in T. We will use T to measure the complexity of a tree. For any given complexity, we want the tree minimizing square error on training set. Finding the optimal binary tree of a given complexity is computationally intractable. We proceed with a greedy algorithm Means build the tree one node at a time, without any planning ahead. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 8 / 36

Regression Trees Root Node, Continuous Variables Let x = (x 1,...,x d ) R d. Splitting variable j {1,...,d}. Split point s R. Partition based on j and s: R 1 (j,s) = {x x j s} R 2 (j,s) = {x x j > s} avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 9 / 36

Regression Trees Root Node, Continuous Variables For each splitting variable j and split point s, Find j, s minimizing i:x i R 1 (j,s) ĉ 1 = ave(y i x i R 1 ) ĉ 2 = ave(y i x i R 2 ) (y i ĉ 1 (j,s)) 2 + i:x i R 2 (j,s) (y i ĉ 2 (j,s)) 2 avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 10 / 36

Regression Trees Then Proceed Recursively 1 We have determined R 1 and R 2 2 Find best split for points in R 1 3 Find best split for points in R 2 4 Continue... When do we stop? avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 11 / 36

Regression Trees Complexity Control Strategy If the tree is too big, we may overfit. If too small, we may miss patterns in the data (underfit). Typical approach: 1 Build a really big tree (e.g. until all regions have 5 points). 2 Prune the tree. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 12 / 36

Regression Trees Tree Terminology Each internal node has a splitting variable and a split point corresponds to binary partition of the space A terminal node or leaf node corresponds to a region corresponds to a particular prediction A subtree T T 0 is any tree obtained by pruning T 0, which means collapsing any number of its internal nodes. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 13 / 36

Regression Trees Tree Pruning Full Tree T 0 From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 14 / 36

Regression Trees Tree Pruning Subtree T T 0 From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 15 / 36

Regression Trees Emprical Risk and Tree Complexity Suppose we want to prune a big tree T 0. Let ˆR(T ) be the empirical risk of T (i.e. square error on training) Clearly, for any T T 0, ˆR(T ) ˆR(T 0 ). Let T be the number of terminal nodes in T. T is our measure of complexity for a tree. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 16 / 36

Regression Trees Cost Complexity (or Weakest Link) Pruning Definitions The cost complexity criterion with parameter α is C α (T ) = ˆR(T ) + α T Trades off between empirical risk and complexity of tree. Cost complexity pruning: For each α, find the tree T T 0 minimizing C α (T ). Use cross validation to find the right choice of α. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 17 / 36

Regression Trees Greedy Pruning is Sufficient Find subtree T 1 T 0 that minimizes ˆR(T 1 ) ˆR(T 0 ). Then find T 2 T 1. Repeat until we have just a single node. If N is the number of nodes of T 0 (terminal and internal nodes), then we end up with a set of trees: T = { T 0 T 1 T 2 T N } Breiman et al. (1984) proved that this is all you need. That is: { } arg min T T 0 C α (T ) α 0 T avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 18 / 36

Regression Trees Regularization Path for Trees avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 19 / 36

Classification Trees Classification Trees Consider classification case: Y = {1,2,...,K}. We need to modify criteria for splitting nodes method for pruning tree avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 20 / 36

Classification Trees Classification Trees Let node m represent region R m, with N m observations Denote proportion of observations in R m with class k by ˆp mk = 1 m {i:x i R m } Predicted classification for node m is 1(y i = k). k(m) = arg max ˆp mk. k Predicted class probability distribution is (ˆp m1,..., ˆp mk ). avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 21 / 36

Classification Trees Misclassification Error Consider node m representing region R m, with N m observations Suppose we predict k(m) = arg max ˆp mk k as the class for all inputs in region R m. What is the misclassification rate on the training data? It s just 1 ˆp mk(m). avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 22 / 36

Classification Trees Classification Trees: Node Impurity Measures Consider node m representing region R m, with N m observations How can we generalize from squared error to classification? We will introduce some different measures of node impurity. We want pure leaf nodes (i.e. as close to a single class as possible) We ll find splitting variables and split point minimizing node impurity. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 23 / 36

Classification Trees Two-Class Node Impurity Measures Consider binary classification Let p be the relative frequency of class 1. Here are three node impurity measures as a function of p HTF Figure 9.3 avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 24 / 36

Classification Trees Classification Trees: Node Impurity Measures Consider leaf node m representing region R m, with N m observations Three measures Q m (T ) of node impurity for leaf node m: Misclassification error: 1 ˆp mk(m). Gini index: Entropy or deviance: K ˆp mk (1 ˆp mk ) k=1 K ˆp mk log ˆp mk. k=1 avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 25 / 36

Classification Trees Class Distributions: Pre-split From Criminisi et al. MSR-TR-2011-114, 28 October 2011. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 26 / 36

Classification Trees Class Distributions: Split Search (Maximizing information gain is equivalent to minimizing entropy) From Criminisi et al. MSR-TR-2011-114, 28 October 2011. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 27 / 36

Classification Trees Classification Trees: How exactly do we do this? Let R L and R R be regions corresponding to a potential node split. Suppose we have N L points in R L and N R points in R R. Let Q(R L ) and Q(R R ) be the node impurity measures. The we search for a split that minimizes N L Q(R L ) + N R Q(R R ) avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 28 / 36

Classification Trees Classification Trees: Node Impurity Measures For building the tree, Gini and Entropy are more effective. They push for more pure nodes, not just misclassification rate For pruning the tree, use misclassification error closer to risk estimate. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 29 / 36

Trees in General Missing Features (or Predictors ) Features are also called covariates or predictors. What to do about missing features? Throw out inputs with missing features Impute missing values with feature means If a categorical feature, let missing be a new category. For trees, can use surrogate splits For every internal node, form a list of surrogate features and split points Goal is to approximate the original split as well as possible Surrogates ordered by how well they approximate the original split. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 30 / 36

Trees in General Categorical Features Suppose we have feature with q possible values (unordered). We want to find the best split into 2 groups There are 2 q 1 1 possible partitions. Search time? For binary classification (K = 2), there is an efficient algorithm. (Breiman 1984) Otherwise, can use approximations. Statistical issue? If a category has a very large number of categories, we can overfit. Extreme example: Row Number could lead to perfect classification with a single split. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 31 / 36

Trees in General Trees vs Linear Models Trees have to work much harder to capture linear relations. X 2 2 1 0 1 2 X 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 X 1 X 1 X 2 2 1 0 1 2 X 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 X 1 X 1 From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 32 / 36

Trees in General Interpretability Trees are certainly easy to explain. You can show a tree on a slide. Small trees seem interpretable. For large trees, maybe not so easy. David Rosenberg (New York University) DS-GA 1003 February 28, 2015 33 / 36

Trees in General Trees for Nonlinear Feature Discovery Suppose tree T gives partition R 1,...,R m. Predictions are M f (x) = c m 1(x R m ) m=1 If we make a feature for every region R: 1(x R), we can view this as a linear model. Trees can be used to discover nonlinear features. avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 34 / 36

Trees in General Instability / High Variance of Trees Trees are high variance: If we randomly split the data, we may get quite different trees from each part By contrast, linear models have low variance (at least when well-regularized) Later we investigate several ways to reduce this variance avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 35 / 36

Trees in General Comments about Trees Trees make no use of geometry No inner products or distances called a nonmetric method Feature scale irrelevant Predictions are not continuous not so bad for classification may not be desirable for regression avid Rosenberg (New York University) DS-GA 1003 February 28, 2015 36 / 36