Tree Algorithms in Data Mining: Comparison of rpart and RWeka... and Beyond

Similar documents
Benchmarking Open-Source Tree Learners in R/RWeka

Open-Source Machine Learning: R Meets Weka

partykit: A Toolkit for Recursive Partytioning

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-based Recursive Partitioning

Using multiple models: Bagging, Boosting, Ensembles, Forests

Decision Trees from large Databases: SLIQ

REPORT DOCUMENTATION PAGE

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

How To Find Seasonal Differences Between The Two Programs Of The Labor Market

D-optimal plans in observational studies

Applied Multivariate Analysis - Big data analytics

The Design and Analysis of Benchmark Experiments

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining Practical Machine Learning Tools and Techniques

R: A Free Software Project in Statistical Computing

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Data Mining Classification: Decision Trees

6 Classification and Regression Trees, 7 Bagging, and Boosting

Scaling Up the Accuracy of Naive-Bayes Classiers: a Decision-Tree Hybrid. Ron Kohavi. Silicon Graphics, Inc N. Shoreline Blvd. ronnyk@sgi.

Data Mining. Nonlinear Classification

Open-Source Machine Learning: R Meets Weka

Decision Tree Learning on Very Large Data Sets

Data Mining for Knowledge Management. Classification

Comparison of Data Mining Techniques used for Financial Data Analysis

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Classification and Prediction

Data mining on the EPIRARE survey data

Logistic Model Trees

Chapter 12 Bagging and Random Forests

Roulette Sampling for Cost-Sensitive Learning

Package NHEMOtree. February 19, 2015

Data Mining Methods: Applications for Institutional Research

Gerry Hobbs, Department of Statistics, West Virginia University

Monitoring Structural Change in Dynamic Econometric Models

CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

An Overview and Evaluation of Decision Tree Methodology

Introducing diversity among the models of multi-label classification ensemble

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

2 Decision tree + Cross-validation with R (package rpart)

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Studying Auto Insurance Data

Classification and regression trees

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Selecting Data Mining Model for Web Advertising in Virtual Communities

TACKLING SIMPSON'S PARADOX IN BIG DATA USING CLASSIFICATION & REGRESSION TREES

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining - Evaluation of Classifiers

Model Trees for Classification of Hybrid Data Types

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

New Ensemble Combination Scheme

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

Social Media Mining. Data Mining Essentials

Fortgeschrittene Computerintensive Methoden

{ram.gopal,

Data Mining Part 5. Prediction

Fortgeschrittene Computerintensive Methoden

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Selecting Machine Learning Algorithms using Regression Models

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

A Review of Missing Data Treatment Methods

Beating the MLB Moneyline

Guide to Credit Scoring in R

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

A Property & Casualty Insurance Predictive Modeling Process in SAS

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Classification and Regression Tree Analysis

How Boosting the Margin Can Also Boost Classifier Complexity

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Why Ensembles Win Data Mining Competitions

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Efficiency in Software Development Projects

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Random Forests for Regression and Classification. Adele Cutler Utah State University

Big Data Analysis. Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH

Chapter 12 Discovering New Knowledge Data Mining

Scoring the Data Using Association Rules

An Experimental Study on Rotation Forest Ensembles

Big Data Decision Trees with R

Identification of User Patterns in Social Networks by Data Mining Techniques: Facebook Case

Lecture 10: Regression Trees

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Transcription:

Tree Algorithms in Data Mining: Comparison of rpart and RWeka... and Beyond Achim Zeileis http://statmath.wu.ac.at/~zeileis/

Motivation For publishing new tree algorithms, benchmarks against established methods are necessary. When developing the tools in party, we benchmarked against rpart, the open-source implementation of CART. Statistical journals were usually happy with that. Usual comment from machine learners: You have to benchmark against C4.5, it s much better than CART! Quinlan provided source code for C4.5, but not with a license that would allow usage. Weka had an open-source Java implementation, but hard to access from R. When we developed RWeka, we finally were able to set up some benchmark with CART and C4.5 within R.

Tree algorithms CART/RPart rpart: Classification and regression trees Breiman, Friedman, Olshen, Stone 1984. Cross-validation-based cost-complexity pruning: RPart0: Best prediction error. RPart1: Highest complexity parameter within 1 standard error. C4.5/J4.8 RWeka: C4.5 Quinlan, 1993. Determine size by confidence threshold C and minimal leaf size M: J4.8: Standard heuristics C = 0.25, M = 2. J4.8cv: Cross-validation for C = 0.01,..., 0.5, M = 2,..., 20. QUEST LohTools: Quick, unbiased and efficient statistical trees Loh, Shih 1997. Popularized concept of unbiased recursive partitioning in statistics. Hand-crafted convenience interface to original binaries. CTree party: Conditional inference trees Hothorn, Hornik, Zeileis 2006. Unbiased recursive partitioning based on permutation tests.

UCI data sets mlbench Data set # of obs. # of cat. inputs # of num. inputs breast cancer 699 9 chess 3196 36 circle 1000 2 credit 690 24 heart 303 8 5 hepatitis 155 13 6 house votes 84 435 16 ionosphere 351 1 32 liver 345 6 Pima Indians diabetes 768 8 promotergene 106 57 ringnorm 1000 20 sonar 208 60 spirals 1000 2 threenorm 1000 20 tictactoe 958 9 titanic 2201 3 twonorm 1000 20

Analysis 6 tree algorithms. 18 data sets. 500 bootstrap samples for each combination. Performance measure: Out-of-bag misclassification rate. Complexity measure: Number of splits + number of leafs. Individual results: Simultaneous pairwise confidence intervals Tukey all-pair comparisons. Aggregated results: Bradley-Terry model Alternatively: median linear consensus ranking,....

Individual results: Pima Indian diabetes J4.8cv J4.8 RPart0 J4.8 RPart1 J4.8 QUEST J4.8 CTree J4.8 RPart0 J4.8cv RPart1 J4.8cv QUEST J4.8cv CTree J4.8cv RPart1 RPart0 QUEST RPart0 CTree RPart0 QUEST RPart1 CTree RPart1 CTree QUEST 2.5 2.0 1.5 1.0 0.5 0.0 0.5 Misclassification difference in percent

Individual results: Pima Indian diabetes J4.8cv J4.8 RPart0 J4.8 RPart1 J4.8 QUEST J4.8 CTree J4.8 RPart0 J4.8cv RPart1 J4.8cv QUEST J4.8cv CTree J4.8cv RPart1 RPart0 QUEST RPart0 CTree RPart0 QUEST RPart1 CTree RPart1 CTree QUEST 80 60 40 20 0 20 Complexity difference

Individual results: Breast cancer J4.8cv J4.8 RPart0 J4.8 RPart1 J4.8 QUEST J4.8 CTree J4.8 RPart0 J4.8cv RPart1 J4.8cv QUEST J4.8cv CTree J4.8cv RPart1 RPart0 QUEST RPart0 CTree RPart0 QUEST RPart1 CTree RPart1 CTree QUEST 1.0 0.5 0.0 0.5 1.0 Misclassification difference in percent

Individual results: Breast cancer J4.8cv J4.8 RPart0 J4.8 RPart1 J4.8 QUEST J4.8 CTree J4.8 RPart0 J4.8cv RPart1 J4.8cv QUEST J4.8cv CTree J4.8cv RPart1 RPart0 QUEST RPart0 CTree RPart0 QUEST RPart1 CTree RPart1 CTree QUEST 15 10 5 0 5 10 Complexity difference

Aggregated results: Misclassification Worth parameters 0.0 0.2 0.4 0.6 J4.8 J4.8cv RPart0 RPart1 QUEST CTree Objects

Aggregated results: Complexity Worth parameters 0.0 0.2 0.4 0.6 J4.8 J4.8cv RPart0 RPart1 QUEST CTree Objects

Summary No clear preference between CART/RPart and C4.5/J4.8. Other tree algorithms perform similarly well. Cross-validated trees perform better than their counterparts. 1-standard error rule does not seem to be supported. And now for something different: Before: Pairwise comparisons of tree algorithms. Now: Tree algorithm for pairwise comparison data.

Model-based recursive partitioning Generic algorithm: 1 Fit parametric model for Y. 2 Assess stability of the model parameters over each splitting variable Z j. 3 Split sample along the Z j with strongest association: Choose breakpoint with highest improvement of the model fit. 4 Repeat steps 1 3 recursively in the subsamples until no more significant instabilities. Application: Use Bradley-Terry models in step 1. Implementation: psychotree on R-Forge.

Germany s Next Topmodel Study at Department of Psychology, Universität Tübingen. 192 subjects rated the attractiveness of candidates in 2nd season of Germany s Next Topmodel. 6 finalists: Barbara Meier, Anni Wendler, Hana Nitsche, Fiona Erdmann, Mandy Graff and Anja Platzer. Pairwise comparison with forced choice. Subject covariates: Gender, age, questions about interest in the show.

Germany s Next Topmodel

Germany s Next Topmodel 1 age p < 0.001 2 q2 p = 0.017 yes no 52 > 52 4 gender p = 0.007 male female 0.5 Node 3 n = 35 0.5 Node 5 n = 71 0.5 Node 6 n = 56 0.5 Node 7 n = 30 0 0 0 0 B Ann H F M Anj B Ann H F M Anj B Ann H F M Anj B Ann H F M Anj

References Hothorn T, Leisch F, Zeileis A, Hornik K 2005. The Design and Analysis of Benchmark Experiments. Journal of Computational and Graphical Statistics, 143, 675 699. doi:10.1198/106186005x59630 Schauerhuber M, Zeileis A, Meyer D 2008. Benchmarking Open-Source Tree Learners in R/RWeka. In C Preisach, H Burkhardt, L Schmidt-Thieme, R Decker eds., Data Analysis, Machine Learning and Applications Proceedings of the 31st Annual Conference of the Gesellschaft für Klassifikation e.v., Albert-Ludwigs-Universität Freiburg, March 7 9, 2007. pp. 389 396. Hornik K, Buchta C, Zeileis A 2009. Open-Source Machine Learning: R Meets Weka. Computational Statistics, 242, 225 232. doi:10.1007/s00180-008-0119-7 Strobl C, Wickelmaier F, Zeileis A 2009. Accounting for Individual Differences in Bradley-Terry Models by Means of Recursive Partitioning. Technical Report 54, Department of Statistics, Ludwig-Maximilians-Universität München. URL http://epub.ub.uni-muenchen.de/10588/