partykit: A Toolkit for Recursive Partytioning

Similar documents

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Open-Source Machine Learning: R Meets Weka

Benchmarking Open-Source Tree Learners in R/RWeka

R: A Free Software Project in Statistical Computing

Predictive Modeling of Titanic Survivors: a Learning Competition

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Science with R Ensemble of Decision Trees

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Open-Source Machine Learning: R Meets Weka

Gerry Hobbs, Department of Statistics, West Virginia University

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data mining on the EPIRARE survey data

Package missforest. February 20, 2015

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Efficient Integration of Data Mining Techniques in Database Management Systems

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Model-based Recursive Partitioning

Guide to Credit Scoring in R

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Package hybridensemble

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

REPORT DOCUMENTATION PAGE

Implementing a Class of Structural Change Tests: An Econometric Computing Approach

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Regression Modeling Strategies

Package bigrf. February 19, 2015

Final Project Report

Data Mining Classification: Decision Trees

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

Big Data Decision Trees with R

Introduction Predictive Analytics Tools: Weka

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

SAS Software to Fit the Generalized Linear Model

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

An Introduction to WEKA. As presented by PACE

Package acrm. R topics documented: February 19, 2015

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining Practical Machine Learning Tools and Techniques

Social Media Mining. Data Mining Essentials

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

8. Machine Learning Applied Artificial Intelligence

Visualizing Categorical Data in ViSta

Chapter 12 Discovering New Knowledge Data Mining

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

TACKLING SIMPSON'S PARADOX IN BIG DATA USING CLASSIFICATION & REGRESSION TREES

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

COC131 Data Mining - Clustering

Consensus Based Ensemble Model for Spam Detection

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Ensembles and PMML in KNIME

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page Analytics and Data Mining 1

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Oracle Data Miner (Extension of SQL Developer 4.0)

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Data mining techniques: decision trees

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

HLM software has been one of the leading statistical packages for hierarchical

Applied Multivariate Analysis - Big data analytics

Fortgeschrittene Computerintensive Methoden

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Testing, Monitoring, and Dating Structural Changes in Exchange Rate Regimes

Statistics Graduate Courses

How To Find Seasonal Differences Between The Two Programs Of The Labor Market

Data Science with R. Introducing Data Mining with Rattle and R.

D-optimal plans in observational studies

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Multivariate Logistic Regression

Web Document Clustering

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Monitoring Structural Change in Dynamic Econometric Models

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Modelling and added value

Analysis Tools and Libraries for BigData

Risk pricing for Australian Motor Insurance

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

VI. Introduction to Logistic Regression

Package MDM. February 19, 2015

Data Mining Introduction

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Contents WEKA Microsoft SQL Database

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Applied Predictive Modeling in R R. user! 2014

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Chapter 12 Bagging and Random Forests

How To Understand How Weka Works

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Lecture 10: Regression Trees

user! 2013 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT

Transcription:

partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/

Overview Status quo: R software for tree models New package: partykit Unified infrastructure for recursive partytioning Classes and methods Interfaces to rpart, J48,... Illustrations Future: Next steps

R software for tree models Status quo: The CRAN task view on Machine Learning at http://cran.r-project.org/view=machinelearning lists numerous packages for tree-based modeling and recursive partitioning, including rpart (CART), tree (CART), mvpart (multivariate CART), RWeka (J4.8, M5, LMT), party (CTree, MOB), and many more... Related: Packages for tree-based ensemble methods such as random forests or boosting, e.g., randomforest, gbm, mboost, etc.

R software for tree models Moreover: Some tree algorithms with R packages that are not on CRAN, e.g., STIMA, and TINT. And further tree algorithms/software without R interface, e.g., QUEST, GUIDE, LOTUS, CRUISE,... Currently: All algorithms/software have to deal with similar problems but provide different solutions without reusing code.

R software for tree models Challenge: For implementing new algorithms in R, code is required not only for fitting the tree model on the learning data but also representing fitted trees, printing trees, plotting trees, computing predictions from trees.

R software for tree models Question: Wouldn t it be nice if there were an R package that provided code for representing fitted trees, printing trees, plotting trees, computing predictions from trees?

R software for tree models Answer: The R package partykit provides unified infrastructure for recursive partytioning, especially representing fitted trees, printing trees, plotting trees, computing predictions from trees!

partykit: Unified infrastructure Design principles: Toolkit for recursive partytitioning. One agnostic base class which can encompass an extremely wide range of different types of trees. Subclasses for important types of trees, e.g., trees with constant fits in each terminal node. Nodes are recursive objects, i.e., a node can contain child nodes. Keep (learning) data out of the recursive node and split structure. Basic printing, plotting, and predicting for raw node structure. Customization via suitable panel or panel-generating functions. Coercion from existing objects (rpart, J48, etc.) to new class. Use simple/fast S3 classes and methods.

partykit: Base classes Class constructors: Generate basic building blocks. partysplit(varid, breaks = NULL, index = NULL,...) where breaks provides the breakpoints wrt variable varid; index determines to which kid node observations are assigned. partynode(id, split = NULL, kids = NULL,...) where split is a partysplit and kids a list of partynode s. party(node, data, fitted = NULL,...) where node is a partynode and data the corresponding (learning) data (optionally without any rows) and fitted the corresponding fitted nodes. Additionally: All three objects have an info slot where optionally arbitrary information can be stored.

partykit: Further classes and methods Further classes: For trees with constant fits in each terminal node, both inheriting from party. constparty : Stores full observed response and fitted terminal nodes in fitted; predictions are computed from empirical distribution of the response. simpleparty : Stores only one predicted response value along with some summary details (such as error and sample size) for each terminal node in the corresponding info. Methods: Display: print(), plot(), predict(). Query: length(), width(), depth(), names(), nodeids(). Extract: [, [[, nodeapply(). Coercion: as.party().

partykit: Illustration Intention: Solution: Illustrate several trees using the same data. Not use the iris data (or something from mlbench). Use Titanic survival data (oh well... ). In case you are not familiar with it: Survival status, gender, age (child/adult), and class (st, 2nd, 3rd, crew) for the 22 persons on the ill-fated maiden voyage of the Titanic. Question: Who survived? Or how does the probability of survival vary across the covariates?

partykit: Interface to rpart CART: Apply rpart() to preprocessed ttnc data (see also below). R> rp <- rpart(survived ~ Gender + Age + Class, data = ttnc) Standard plot: R> plot(rp) R> text(rp) Visualization via partykit: R> plot(as.party(rp))

partykit: Interface to rpart Gender=a Age=b Class=c No No Class=c Yes No Yes

partykit: Interface to rpart Gender Male Female 2 Age 7 Class Adult Child 4 Class 3rd st, 2nd, Crew 3rd st, 2nd Node 3 (n = 667) Node 5 (n = 48) Node 6 (n = 6) Node 8 (n = 96) Node 9 (n = 274)

partykit: Interface to rpart R> rp n= 22 node), split, n, loss, yval, (yprob) * denotes terminal node ) root 22 7 No (76965.32335) 2) Gender=Male 73 367 No (.7879838 262) 4) Age=Adult 667 338 No (.797246 27594) * 5) Age=Child 64 29 No (.546875 5325) ) Class=3rd 48 3 No (.729667 78333) * ) Class=st,2nd 6 Yes (..) * 3) Gender=Female 47 26 Yes (6885.73949) 6) Class=3rd 96 9 No (.54863 59837) * 7) Class=st,2nd,Crew 274 2 Yes (.729927.92773) *

partykit: Interface to rpart R> as.party(rp) Model formula: Survived ~ Gender + Age + Class Fitted party: [] root [2] Gender in Male [3] Age in Adult: No (n = 667, err = 2.3%) [4] Age in Child [5] Class in 3rd: No (n = 48, err = 27.%) [6] Class in st, 2nd: Yes (n = 6, err =.%) [7] Gender in Female [8] Class in 3rd: No (n = 96, err = 45.9%) [9] Class in st, 2nd, Crew: Yes (n = 274, err = 7.3%) Number of inner nodes: 4 Number of terminal nodes: 5

partykit: Interface to rpart Prediction: Compare rpart s C code and partykit s R code for (artificially) large data set. R> nd <- ttnc[rep(:nrow(ttnc), ), ] R> system.time(p <- predict(rp, newdata = nd, type = "class")) user system elapsed 2.86.2 2.835 R> system.time(p2 <- predict(as.party(rp), newdata = nd)) user system elapsed 56. 54 R> table(rpart = p, party = p2) party rpart No Yes No 9 Yes 29

partykit: Interface to J48 J4.8: Open-source implementation of C4.5 in RWeka. R> j48 <- J48(Survived ~ Gender + Age + Class, data = ttnc) Results in a tree with multi-way splits which previously could only be displayed via Weka itself or Graphviz but not in R directly. Now: R> j48p <- as.party(j48) R> plot(j48p) Or just a subtree: R> plot(j48p[])

partykit: Interface to J48 Gender Male Female 2 Class Class st 2nd 3rdCrew 3 Age 6 Age st 2nd 3rd Crew Child Adult Child Adult n = 5 n = 75 n = n = 68 n = 5 n = 862 n = 45 n = 6 n = 96 n = 23

partykit: Interface to J48 Class st 2nd 3rd Crew Node 2 (n = 45) Node 3 (n = 6) Node 4 (n = 96) Node 5 (n = 23)

partykit: Further interfaces PMML: Predictive Model Markup Language. XML-based format exported by various software packages including SAS, SPSS, R/pmml. Here, reimport CART tree. R> pm <- pmmltreemodel("ttnc.xml") evtree: Evolutionary learning of globally optimal trees, directly using partykit. R> set.seed(7) R> ev <- evtree(survived ~ Gender + Age + Class, data = ttnc, + maxdepth = 3) CTree: Conditional inference trees ctree() are reimplemented more efficiently within partykit. CHAID: R package on R-Forge, directly using partykit. (Alternatively, use SPSS and export via PMML.)

partykit: PMML Gender Male Female 2 Age 7 Class Adult Child 3rd st, 2nd, Crew 3 No (n = 667, err = 2.3%) 4 Class 8 No (n = 96, err = 45.9%) 9 Yes (n = 274, err = 7.3%) 3rd st, 2nd 5 No (n = 48, err = 27.%) 6 Yes (n = 6, err =.%)

partykit: evtree Class st, 2nd, Crew 3rd 2 Gender Female Male 4 Age Child Adult Node 3 (n = 274) Node 5 (n = 6) Node 6 (n = 25) Node 7 (n = 76)

Next steps To do: The current code is already fairly well-tested and mostly stable. However, there are some important items on the task list. Extend/smooth package vignette. Add xtrafo/ytrafo to ctree() reimplementation. Switch mob() to new party class. Create new subclass modelparty that facilitates handling of formulas, terms, model frames, etc. for model-based trees.

Next steps Model-based recursive partitioning: Trees with parametric models in each node (e.g., based on least squares or maximum likelihood). Splitting based on parameter instability tests. Illustration: Logistic regression (or logit model), assessing differences in the effect of preferential treatment for women or children. R> library("party") R> mb <- mob(survived ~ Treatment Age + Gender + Class, + data = ttnc, model = glinearmodel, family = binomial(), + control = mob_control(alpha =.)) Odds ratio of survival given treatment differs across subsets (slope), as does the survival probability of male adults (intercept). R> coef(mb) (Intercept) TreatmentPreferential\n(Female Child) 2 -.64937.32696 4-2.397895 4.477337 5 -.5245 4.3823

Next steps Class p <. 3rd {st, 2nd, Crew} 3 Class p <. 2nd {st, Crew} Node 2 (n = 76) Node 4 (n = 285) Node 5 (n = 2) Normal (Male&Adult) Preferential (Female Child) Normal (Male&Adult) Preferential (Female Child) Normal (Male&Adult) Preferential (Female Child)

Computational details Software: All examples have been produced with R 2.4. and packages partykit.-2, rpart 3.-5, RWeka -9, evtree.-, party.9-99995. All packages are freely available under the GPL from http://cran.r-project.org/. Data: The contingency table Titanic is transformed to the data frame ttnc using the following code. R> data("titanic", package = "datasets") R> ttnc <- as.data.frame(titanic) R> ttnc <- ttnc[rep(:nrow(ttnc), ttnc$freq), :4] R> names(ttnc)[2] <- "Gender" R> ttnc <- transform(ttnc, Treatment = factor( + Gender == "Female" Age == "Child", + levels = c(false, TRUE), + labels = c("normal\n(male&adult)", "Preferential\n(Female Child)") + )) The last transformation also adds the treatment variable for the model-based recursive partitioning analysis.

References Hothorn T, Zeileis A (2). partykit: A Toolkit for Recursive Partytioning. R package vignette version.-2. URL http://cran.r-project.org/package=partykit Hothorn T, Hornik K, Zeileis A (26). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 5(3), 65 674. doi:.98/6866x33933 Grubinger T, Zeileis A, Pfeiffer KP (2). evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R. Working Paper 2-2, Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, Universität Innsbruck. URL http://econpapers.repec.org/repec:inn:wpaper:2-2 Zeileis A, Hothorn T, Hornik K (28). Model-Based Recursive Partitioning. Journal of Computational and Graphical Statistics, 7(2), 492 54. doi:.98/6868x3933