M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest



Similar documents
Chapter 12 Bagging and Random Forests

Package trimtrees. February 20, 2015

Classification and Regression by randomforest

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Applied Multivariate Analysis - Big data analytics

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Data Mining. Nonlinear Classification

Package bigrf. February 19, 2015

Data Science with R Ensemble of Decision Trees

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Package missforest. February 20, 2015

Didacticiel Études de cas

Predicting borrowers chance of defaulting on credit loans

2 Decision tree + Cross-validation with R (package rpart)

Predictive Data modeling for health care: Comparative performance study of different prediction models

Homework Assignment 7

Knowledge Discovery and Data Mining

Data Analysis of Trends in iphone 5 Sales on ebay

Benchmarking Open-Source Tree Learners in R/RWeka

Fast Analytics on Big Data with H20

Decision Trees from large Databases: SLIQ

R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi

Data mining on the EPIRARE survey data

Using multiple models: Bagging, Boosting, Ensembles, Forests

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Science with R. Introducing Data Mining with Rattle and R.

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Data Mining - Evaluation of Classifiers

Mining the Software Change Repository of a Legacy Telephony System

Implementation of Breiman s Random Forest Machine Learning Algorithm

1 Topic. 2 Scilab. 2.1 What is Scilab?

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

Leveraging Ensemble Models in SAS Enterprise Miner

# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Data Mining Methods: Applications for Institutional Research

Computational Assignment 4: Discriminant Analysis

Package MixGHD. June 26, 2015

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Mining Wiki Usage Data for Predicting Final Grades of Students

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report

How to Deploy Models using Statistica SVB Nodes

# For usage of the functions, it is necessary to install the "survival" and the "penalized" package.

Data Mining Practical Machine Learning Tools and Techniques

An Experimental Study on Rotation Forest Ensembles

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Lecture 13: Validation

CART 6.0 Feature Matrix

Comparison of Data Mining Techniques used for Financial Data Analysis

Gerry Hobbs, Department of Statistics, West Virginia University

Knowledge Discovery and Data Mining

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

ModelMap: an R Package for Model Creation and Map Production

Chapter 7. Feature Selection. 7.1 Introduction

Using Random Forest to Learn Imbalanced Data

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

On the application of multi-class classification in physical therapy recommendation

Microsoft Azure Machine learning Algorithms

Consensus Based Ensemble Model for Spam Detection

How To Make A Credit Risk Model For A Bank Account

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

partykit: A Toolkit for Recursive Partytioning

Tutorial Segmentation and Classification

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Knowledge Discovery and Data Mining

Exploratory Data Analysis using Random Forests

1. Classification problems

The Operational Value of Social Media Information. Social Media and Customer Interaction

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

New Ensemble Combination Scheme

Evaluation & Validation: Credibility: Evaluating what has been learned

II. RELATED WORK. Sentiment Mining

Predicting Flight Delays

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Data Mining Lab 5: Introduction to Neural Networks

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Graphics in R. Biostatistics 615/815

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Fraud Detection for Online Retail using Random Forests

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Chapter 6. The stacking ensemble approach

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Transcription:

Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how to train and use random forests with R. The method is implemented in the package randomforest which is to be installed and loaded. The method is illustrated on examples taken from the UCI Machine Learning repository 1, which are available in the R package mlbench: library(randomforest) randomforest 4.6-10 Type rfnews() to see new features/changes/bug fixes. library(mlbench) Exercice 1 Random forest: training, interpretation and tuning This exercise illustrates the use of random forest to discriminate sonar signals bounced off a metal cylinder from those bounced off a roughly cylindrical rock. Data are contained in the dataset Sonar: data(sonar) 1. Using help(sonar) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 208 observations on 60 predictors, all of them being numerical. The last variable of the dataset Class defines the class (M or R) of the observation, which is to be predicted. 2. Randomly split the data into a training and a test sets of size 100 and 108, respectively. Answer: ind.test <- sample(1:208, 108, replace=false) sonar.train <- Sonar[-ind.test, ] sonar.test <- Sonar[ind.test, ] 3. Use the function randomforest to train a random forest on the training set with default parameters (use the formula syntax described in help(randomforest) and the options keep.inbag=true and importance=true). What is the OOB error? Make a contingency tables between OOB predictions and real classes. Answer: rf.sonar <- randomforest(class~., data=sonar.train, keep.inbag=true, importance=true) rf.sonar 1 http://archive.ics.uci.edu/ml

Call: randomforest(formula = Class ~., data = sonar.train, keep.inbag = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 7 OOB estimate of error rate: 20% Confusion matrix: M R class.error M 41 10 0.1960784 R 10 39 0.2040816 4. Using the help, describe what is in predicted, err.rate, confusion, votes, oob.times, importance, ntree, mtry and inbag in the random forest object trained in the previous question. Answer: head(rf.sonar$predicted) 1 3 4 5 6 7 M R R R R R Levels: M R contains the OOB predictions of the random forest; head(rf.sonar$err.rate) OOB M R [1,] 0.2777778 0.1666667 0.3888889 [2,] 0.3050847 0.3000000 0.3103448 [3,] 0.3513514 0.2702703 0.4324324 [4,] 0.3536585 0.3095238 0.4000000 [5,] 0.3563218 0.3181818 0.3953488 [6,] 0.3820225 0.3409091 0.4222222 contains the evolution of the misclassification rate during the learning process while the number of trees in the forest is increasing. The misclassification rate is given for each of the two classes and as an overall miscalssification rate. It is estimated from OOB predictions; rf.sonar$confusion M R class.error M 41 10 0.1960784 R 10 39 0.2040816 contains the confusion matrix between OOB predictions and real classes; head(rf.sonar$votes) M R 1 0.5191257 0.4808743 3 0.4303030 0.5696970 4 0.4663212 0.5336788 5 0.4754098 0.5245902 6 0.3212435 0.6787565 7 0.4358974 0.5641026 contains the ratio of trees voting for each of the two classes for every observations (it is thus a 100 2 matrix). The votes are based on trees for which the observations is out-of-the-bag; head(rf.sonar$oob.times) [1] 183 165 193 183 193 195 contains, for every observation, the number of trees for which the observation is out-of-bag;

head(rf.sonar$importance) M R MeanDecreaseAccuracy MeanDecreaseGini V1-0.0002710545 3.415679e-03 0.0015713513 0.6739892 V2-0.0006232596 9.729683e-04 0.0001857454 0.4562138 V3-0.0007119497-4.074436e-04-0.0006860018 0.4296529 V4 0.0013848425 2.725338e-03 0.0021017999 0.7724776 V5-0.0001969799 2.529269e-04-0.0001046821 0.4038784 V6 0.0001138834 3.569351e-05 0.0002106271 0.4415082 contains, for every variable, the importance of the variable, expressed as the mean decrease accuracy in predicted each of the two classes, or the overall mean decrease accuracy or another importance index which is the mean decrease in Gini index (not explained in the lesson); rf.sonar$ntree; rf.sonar$mtry [1] 500 [1] 7 respectively contain the number of trees in the forest and the number of variables used for each split (here equal to 7 60); head(rf.sonar$inbag[,1:5]) [,1] [,2] [,3] [,4] [,5] [1,] 0 0 1 1 0 [2,] 1 1 0 1 1 [3,] 1 1 1 0 1 [4,] 1 1 1 1 1 [5,] 0 1 1 0 0 [6,] 1 0 0 0 0 is a 100 500-matrix which contains 1 each time an observation (row) is included in a the bag of a tree (column). 5. What is the OOB prediction for the 50th observation in your training sample? What is the true class for this observation? How many times has it been included in a tree and in which trees has it been included? How many trees voted for the prediction R for this observation? Answer: The OOB prediction for the 50th observation in the training sample is rf.sonar$predicted[50] 98 R Levels: M R and the true class for this observation is sonar.train$class[50] [1] M Levels: M R The observation has been included in nb.oob.50 <- rf.sonar$oob.times[50] rf.sonar$ntree - nb.oob.50 [1] 299 trees which are the trees numbered: which(rf.sonar$inbag[50, ]==1)

[1] 3 4 5 8 10 11 12 14 15 18 21 22 23 24 27 28 30 [18] 31 32 34 38 39 42 43 44 48 50 54 55 56 57 58 59 60 [35] 61 64 65 66 67 69 73 74 75 76 77 80 81 82 84 85 87 [52] 88 89 90 91 92 94 95 97 98 100 101 106 107 108 111 112 113 [69] 114 115 116 117 118 119 123 125 126 128 130 131 134 140 141 142 143 [86] 148 149 150 151 152 153 154 155 158 160 161 163 164 165 166 167 168 [103] 170 171 172 173 174 176 180 181 185 187 188 189 190 191 194 195 199 [120] 201 203 204 205 206 207 209 213 214 216 223 232 233 235 237 239 240 [137] 245 246 247 251 252 253 254 255 259 260 265 267 268 269 270 271 272 [154] 275 276 278 279 280 282 284 285 287 289 290 293 294 295 296 298 300 [171] 303 304 305 306 309 311 312 314 316 320 321 322 324 325 326 331 332 [188] 334 337 338 339 340 341 342 343 344 347 348 349 351 352 353 354 355 [205] 356 357 359 363 364 365 367 369 370 371 372 373 374 375 378 379 380 [222] 381 382 383 384 386 387 390 391 392 393 395 397 398 399 400 401 402 [239] 403 406 407 408 409 410 411 412 416 418 420 423 425 426 427 428 429 [256] 430 431 432 433 434 435 436 437 439 440 441 443 444 447 448 449 450 [273] 453 454 455 456 457 459 460 462 463 467 468 470 472 474 475 476 477 [290] 479 483 485 486 488 489 491 495 498 499 For this observations, the number of (OOB) trees that have voted R is equal to rf.sonar$votes[50,"r"] * nb.oob.50 [1] 132 (over 201 OOB trees for this observation). 6. Using the plot.randomforest function, make a plot that displays the evolution of the OOB misclassification rate when the number of trees in the forest increases. What is represented by the different colors? Add a legend and comment. Answer: plot(rf.sonar, main="evolution of misclassification rates")

Evolution of misclassification rates Error 0.15 0.20 0.25 0.30 0.35 0.40 0 100 200 300 400 500 trees The three graphics show the evolution of the misclassification rate, the misclassification rate for the "R" class and the misclassification rate for the "M" class respectively. tail(rf.sonar$err.rate) OOB M R [495,] 0.2 0.1960784 0.2040816 [496,] 0.2 0.1960784 0.2040816 [497,] 0.2 0.1960784 0.2040816 [498,] 0.2 0.1960784 0.2040816 [499,] 0.2 0.1960784 0.2040816 [500,] 0.2 0.1960784 0.2040816 shows that the black line corresponds to the overall misclassification rate, the green one to the misclassification rate for "R" and the red one to the misclassification rate for "M". The legend can be added with: plot(rf.sonar, main="evolution of misclassification rates") legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R"))

Evolution of misclassification rates Error 0.15 0.20 0.25 0.30 0.35 0.40 overall class M class R 0 100 200 300 400 500 trees The graphic shows that the three final misclassification rates are close, all equal to about 20%. 7. Make a barplot of the variable importances ranked in decreasing order. Comment. Answer: ordered.variables <- order(rf.sonar$importance[,3], decreasing=true) barplot(rf.sonar$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

Variable importance (decreasing order) V11 V12 V49 V9 V27 V43 V10 V26 V16 V31 V36 V58 V22 V37 V23 V52 V35 V45 V15 V20 V21 V48 V44 V51 V28 V4 V33 V17 V55 V1 V54 V14 V41 V46 V29 V39 V19 V53 V38 V24 V47 V59 V7 V13 V34 V57 V25 V8 V30 V42 V60 V6 V18 V2 V50 V56 V5 V40 V3 V32 0.000 0.005 0.010 0.015 0.020 Two variables have a particularly large importance in the accuracy of the forest (V11 and V12) and three other are rather importance also (V49, V9 and V27). 8. With the function predict.randomforest, predict the output of the random forest on the test set and find the test misclassification rate. Comment. Answer: The output of the random forest for the test set are found with: pred.test <- predict(rf.sonar, sonar.test) Hence, the misclassification rate for the test set is equal to: sum(pred.test!= sonar.test$class)/length(pred.test) [1] 0.1296296 which is smaller than the OOB misclassification rate and seems to show that the model does not overfit. 9. Use the package e1071 and the function tune.randomforest to find, with a 10-fold cross validation performed on the training set only, which pair of parameters is the best among ntree=c(500, 1000, 1500, 2000) and mtry=c(5, 7, 10, 15, 20). Set the option importance to TRUE to obtain the variable importances for the best model. Use the functions summary.tune and plot.tune to interpret the result.

library(e1071) Answer: The tuning is performed with: res.tune.rf <- tune.randomforest(sonar.train[,1:60], sonar.train[,61], mtry=c(5, 7, 10, 15, 20), ntree=c(500, 1000, 1500, 2000), importance=true) which can be analyzed with: summary(res.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree 15 2000 - best performance: 0.17 - Detailed performance results: mtry ntree error dispersion 1 5 500 0.21 0.1449138 2 7 500 0.21 0.1449138 3 10 500 0.20 0.1247219 4 15 500 0.20 0.1054093 5 20 500 0.21 0.1286684 6 5 1000 0.23 0.1418136 7 7 1000 0.22 0.1398412 8 10 1000 0.20 0.1333333 9 15 1000 0.20 0.1333333 10 20 1000 0.19 0.1370320 11 5 1500 0.21 0.1197219 12 7 1500 0.21 0.1449138 13 10 1500 0.20 0.1333333 14 15 1500 0.18 0.1229273 15 20 1500 0.18 0.1229273 16 5 2000 0.21 0.1449138 17 7 2000 0.22 0.1398412 18 10 2000 0.21 0.1523884 19 15 2000 0.17 0.1159502 20 20 2000 0.21 0.1370320 plot(res.tune.rf)

Performance of randomforest' 2000 0.23 1800 0.22 1600 0.21 1400 ntree 1200 0.20 1000 0.19 800 0.18 600 6 8 10 12 14 16 18 20 0.17 mtry The best parameters obtained by cross validation are a number of trees equal to 2000 and a number of variables used for every split equal to 15. The CV error is then a bit smaller (about 17%) than the OOB error obtained with the default parameters. 10. From the previous question, extract the best model (in the object obtained with the function tune.randomforest, it is in $best.model). Compare its OOB error and its evolution, its most important variables and its test error with the random forest obtained with default parameters. Comment. Answer: The best model is obtained with rf.sonar.2 <- res.tune.rf$best.model rf.sonar.2 Call: best.randomforest(x = sonar.train[, 1:60], y = sonar.train[, 61], mtry = c(5, 7, 10, 15, 20) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 15 OOB estimate of error rate: 19% Confusion matrix: M R class.error M 42 9 0.1764706

R 10 39 0.2040816 The OOB error is thus equal to 19% compared to 20% with the default parameter. The evolution of the OOB error is obtained with: plot(rf.sonar.2) legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R")) rf.sonar.2 Error 0.15 0.20 0.25 0.30 0.35 0.40 overall class M class R 0 500 1000 1500 2000 trees that gives very similar results than with the random forest with default parameters. The important variables are given with the following barplot: ordered.variables <- order(rf.sonar.2$importance[,3], decreasing=true) barplot(rf.sonar.2$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

Variable importance (decreasing order) V11 V49 V12 V27 V9 V43 V52 V31 V15 V10 V26 V28 V21 V36 V37 V23 V16 V35 V17 V20 V1 V58 V48 V45 V47 V22 V51 V54 V33 V44 V18 V50 V46 V39 V7 V14 V4 V25 V34 V13 V8 V19 V55 V32 V60 V40 V30 V2 V38 V24 V42 V29 V59 V5 V3 V56 V53 V6 V41 V57 0.000 0.005 0.010 0.015 0.020 0.025 The five most important variables are the same than the ones found in the previous forest (but not in the same order). Finally, the test error rate is obtained with: pred.test.2 <- predict(rf.sonar.2, sonar.test) sum(pred.test.2!= sonar.test$class)/length(sonar.test$class) [1] 0.1388889 The test error rate is a bit larger than in the previous case, which is expected: the forest is based on trees that are more flexible (more variables used for the splits) and thus tends to overfit more than the previous model. Exercice 2 Comparison between random forest and CART boosting This exercise s aim is to compare random forest and CART boosting from an accuracy point of view. The notion of importance is also introduced for CART boosting. The comparison is illustrated on the dataset Vehicle data(vehicle) (package mlbench) which purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. 1. Using

help(vehicle) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 846 observations on 18 predictors, all of them being numerical. The last variable of the dataset Class defines the class (bus, opel, saab or van) of the observation, which is to be predicted. 2. Split the data set into a training and a test sets, having respective sizes 400 and 446. Answer: The data set is split into a training and test sets with: ind.test <- sample(1:nrow(vehicle), 446, replace=false) vehicle.test <- Vehicle[ind.test, ] vehicle.train <- Vehicle[-ind.test, ] 3. Using the function tune.randomforest, train a random forest with possible number of trees 500, 1000, 2000 and 3000 and possible number of variables selected to every split 4, 5, 6, 8, 10. Keep the variable importances and use the options xtest and ytest to obtain the test misclassification error directly. Analyze the results of the tuning process. Warning: When using the options xtest and ytest, the forest is not kept unless setting keep.forest=true which means that it cannot be used to make predictions with new data. For using it with the method tune, you must thus set keep.forest=true or the function will return an error while trying to compute the CV error. Answer: The forest is tuned with the following command line: vehicle.tune.rf <- tune.randomforest(vehicle.train[,1:18], vehicle.train[,19], xtest=vehicle.test[,1:18], ytest=vehicle.test[,19], mtry=c(4, 5, 6, 8, 10), ntree=c(500, 1000, 2000, 3000), importance=true, keep.forest=true) The performances of the different pairs of parameters can be visualized with: summary(vehicle.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree 4 1000 - best performance: 0.2675 - Detailed performance results: mtry ntree error dispersion 1 4 500 0.2850 0.08010410 2 5 500 0.2850 0.08266398 3 6 500 0.2850 0.09732534 4 8 500 0.2775 0.08696264 5 10 500 0.2850 0.08595865 6 4 1000 0.2675 0.08664262 7 5 1000 0.2700 0.07799573 8 6 1000 0.2875 0.08680278 9 8 1000 0.2775 0.07768347 10 10 1000 0.2750 0.08079466 11 4 2000 0.2775 0.06917169 12 5 2000 0.2725 0.08118053 13 6 2000 0.2725 0.08032054 14 8 2000 0.2825 0.09505116 15 10 2000 0.2775 0.08775756 16 4 3000 0.2825 0.08502451

17 5 3000 0.2750 0.08333333 18 6 3000 0.2775 0.08696264 19 8 3000 0.2800 0.08959787 20 10 3000 0.2850 0.08990736 plot(vehicle.tune.rf) 3000 Performance of randomforest' 2500 0.285 2000 0.280 ntree 1500 0.275 1000 0.270 500 4 5 6 7 8 9 10 mtry The CV misclassification errors range from 0.2675 to 0.2875 with a minimum obtained for 1000 trees trained from 4 variables at each split. 4. Analyze the best forest obtained with the tuning process in the previous question in term of out-of-bag error, test error and their evolutions. What are the most important variables in this model? Answer: The best model obtained by the tuning process is obtained with: vehicle.rf <- vehicle.tune.rf$best.model vehicle.rf Call: best.randomforest(x = vehicle.train[, 1:18], y = vehicle.train[, 19], mtry = c(4, 5, 6, 8, 1 Type of random forest: classification Number of trees: 1000

No. of variables tried at each split: 4 OOB estimate of error rate: 28.5% Confusion matrix: bus opel saab van class.error bus 105 0 0 2 0.018691589 opel 6 48 43 5 0.529411765 saab 4 42 31 11 0.647727273 van 1 0 0 102 0.009708738 Test set error rate: 25.56% Confusion matrix: bus opel saab van class.error bus 111 0 0 0 0.0000000 opel 4 67 32 7 0.3909091 saab 8 49 64 8 0.5038760 van 0 3 3 90 0.0625000 which shows that the OOB misclassification rate is about 28.5% mainly due to a difficulty to discrimate between the classes opel and saab. The test misclassification error is slightly smaller (about 25.56%) with the same problem to discriminate opel and saab observations. The evolution of the misclassification errors is displayed using plot(vehicle.rf, main="error evolutions for the dataset 'Vehicle'") legend(700, 0.45, col=1:5, lty=1, lwd=2, cex=0.8, legend=c("overall", levels(vehicle$class)))

Error evolutions for the dataset 'Vehicle' Error 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 overall bus opel saab van 0 200 400 600 800 1000 trees This figure shows that the misclassification rate is stabilized (or even slightly increases) at the end of the training process (when the number of trees in the forest exceed 500 approximately). The two classes opel and saab exibit the worse misclassification rates, as already explained. Finally, the most important variables are ordered.variables <- order(vehicle.rf$importance[,5], decreasing=true) barplot(vehicle.rf$importance[ordered.variables,5], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

Variable importance (decreasing order) Max.L.Ra Sc.Var.maxis Elong Sc.Var.Maxis D.Circ Scat.Ra Max.L.Rect Holl.Ra Comp Pr.Axis.Rect Skew.Maxis Kurt.Maxis Circ Rad.Ra Pr.Axis.Ra Ra.Gyr Skew.maxis Kurt.maxis 0.00 0.02 0.04 0.06 0.08 0.10 One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 5 variables (SC.Var.maxis, Elong, Sc.Var.Maxis, D.Circ and Scat.Ra) with a large importance compared to the others. 5. This question aims at building a simple bagging of trees to compare with the previous the random forest. The package boot and rpart library(boot) library(rpart) will be used: to do so, answer the following questions: (a) Define a function boot.vehicle <- function(x, d) {... } that returns 3 values: the OOB misclassification rate for the classification tree trained with the subsample x[d, ]; the test misclassification rate for the same classification tree; the misclassification rates obtained with the same classification tree on OOB observations when one of the variable has been permuted (it is thus a vector of size 18, one value for each variable).

Answer: boot.vehicle <- function(x, d) { inbag <- x[d, ] oobag <- x[-d, ] cur.tree <- rpart(class~., data=inbag, method="class") # OOB error cur.oob <- predict(cur.tree, oobag) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) # test error cur.test <- predict(cur.tree, vehicle.test) cur.test <- sum(vehicle.test$class!= colnames(cur.test)[apply(cur.test, 1, which.max)])/ nrow(cur.test) # OOB shuffled error shuffled.oob <- sapply(1:(ncol(oobag)-1), function(ind) { new.oob <- oobag new.oob[,ind] <- oobag[sample(1:nrow(oobag), nrow(oobag), replace=false), ind] cur.oob <- predict(cur.tree, new.oob) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) return(cur.oob) }) return(c(cur.oob, cur.test, shuffled.oob)) } (b) Use the function defined in the previous question with boot to obtain bootstrap estimates of the OOB misclassification rate (computed only from the OOB observations in the training data set) and of the test misclassification rate for a bagging of R trees with R being the optimal parameter found in question 3. Answer: Results of the bagging approach are obtained from: res.bagging <- boot(vehicle.train, boot.vehicle, R=vehicle.tune.rf$best.parameters$ntree) which gives the following OOB misclassification rate: mean(res.bagging$t[,1]) [1] 0.3419846 and the following test misclassification rate: mean(res.bagging$t[,2]) [1] 0.3603901 (c) Use the result of the previous function to compute and represent the variable importances. Answer: A variable s importance is the mean accuracy of the difference between the OOB accuracy and the accuracy obtained with permuted values. difference.acc <- sweep(res.bagging$t[,3:20], 1, res.bagging$t[,1], "-") cart.importance <- apply(difference.acc, 2, mean) names(cart.importance) <- colnames(vehicle.test)[1:18] ordered.variables <- order(cart.importance, decreasing=true) barplot(cart.importance[ordered.variables], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

Variable importance (decreasing order) Max.L.Ra Sc.Var.Maxis Elong Sc.Var.maxis D.Circ Comp Scat.Ra Holl.Ra Max.L.Rect Skew.maxis Skew.Maxis Circ Kurt.maxis Rad.Ra Kurt.Maxis Pr.Axis.Ra Pr.Axis.Rect Ra.Gyr 0.00 0.05 0.10 0.15 0.20 One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 3 variables (SC.Var.maxis, Elong and Sc.Var.Maxis) with a large importance compared to the others. 6. Analyze the results (OOB and test errors, important variables) of the bagging of classification trees by comparing them with the ones obtained from a random forest. Answer: Misclassification rates (for both OOB and test sets) are smaller with random forests than with bagging of trees. Bagging of trees even shows a tendancy to overfit the data (with a larger test error than OOB error). The variable importances are consistent between the two methods with a larger set of important variables for random forest than for bagging of CART.