M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Transcription

1 Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how to train and use random forests with R. The method is implemented in the package randomforest which is to be installed and loaded. The method is illustrated on examples taken from the UCI Machine Learning repository 1, which are available in the R package mlbench: library(randomforest) randomforest Type rfnews() to see new features/changes/bug fixes. library(mlbench) Exercice 1 Random forest: training, interpretation and tuning This exercise illustrates the use of random forest to discriminate sonar signals bounced off a metal cylinder from those bounced off a roughly cylindrical rock. Data are contained in the dataset Sonar: data(sonar) 1. Using help(sonar) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 208 observations on 60 predictors, all of them being numerical. The last variable of the dataset Class defines the class (M or R) of the observation, which is to be predicted. 2. Randomly split the data into a training and a test sets of size 100 and 108, respectively. Answer: ind.test <- sample(1:208, 108, replace=false) sonar.train <- Sonar[-ind.test, ] sonar.test <- Sonar[ind.test, ] 3. Use the function randomforest to train a random forest on the training set with default parameters (use the formula syntax described in help(randomforest) and the options keep.inbag=true and importance=true). What is the OOB error? Make a contingency tables between OOB predictions and real classes. Answer: rf.sonar <- randomforest(class~., data=sonar.train, keep.inbag=true, importance=true) rf.sonar 1

2 Call: randomforest(formula = Class ~., data = sonar.train, keep.inbag = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 7 OOB estimate of error rate: 20% Confusion matrix: M R class.error M R Using the help, describe what is in predicted, err.rate, confusion, votes, oob.times, importance, ntree, mtry and inbag in the random forest object trained in the previous question. Answer: head(rf.sonar$predicted) M R R R R R Levels: M R contains the OOB predictions of the random forest; head(rf.sonar$err.rate) OOB M R [1,] [2,] [3,] [4,] [5,] [6,] contains the evolution of the misclassification rate during the learning process while the number of trees in the forest is increasing. The misclassification rate is given for each of the two classes and as an overall miscalssification rate. It is estimated from OOB predictions; rf.sonar$confusion M R class.error M R contains the confusion matrix between OOB predictions and real classes; head(rf.sonar$votes) M R contains the ratio of trees voting for each of the two classes for every observations (it is thus a matrix). The votes are based on trees for which the observations is out-of-the-bag; head(rf.sonar$oob.times) [1] contains, for every observation, the number of trees for which the observation is out-of-bag;

3 head(rf.sonar$importance) M R MeanDecreaseAccuracy MeanDecreaseGini V e V e V e V e V e V e contains, for every variable, the importance of the variable, expressed as the mean decrease accuracy in predicted each of the two classes, or the overall mean decrease accuracy or another importance index which is the mean decrease in Gini index (not explained in the lesson); rf.sonar$ntree; rf.sonar$mtry [1] 500 [1] 7 respectively contain the number of trees in the forest and the number of variables used for each split (here equal to 7 60); head(rf.sonar$inbag[,1:5]) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] [6,] is a matrix which contains 1 each time an observation (row) is included in a the bag of a tree (column). 5. What is the OOB prediction for the 50th observation in your training sample? What is the true class for this observation? How many times has it been included in a tree and in which trees has it been included? How many trees voted for the prediction R for this observation? Answer: The OOB prediction for the 50th observation in the training sample is rf.sonar$predicted[50] 98 R Levels: M R and the true class for this observation is sonar.train$class[50] [1] M Levels: M R The observation has been included in nb.oob.50 <- rf.sonar$oob.times[50] rf.sonar$ntree - nb.oob.50 [1] 299 trees which are the trees numbered: which(rf.sonar$inbag[50, ]==1)

4 [1] [18] [35] [52] [69] [86] [103] [120] [137] [154] [171] [188] [205] [222] [239] [256] [273] [290] For this observations, the number of (OOB) trees that have voted R is equal to rf.sonar$votes[50,"r"] * nb.oob.50 [1] 132 (over 201 OOB trees for this observation). 6. Using the plot.randomforest function, make a plot that displays the evolution of the OOB misclassification rate when the number of trees in the forest increases. What is represented by the different colors? Add a legend and comment. Answer: plot(rf.sonar, main="evolution of misclassification rates")

5 Evolution of misclassification rates Error trees The three graphics show the evolution of the misclassification rate, the misclassification rate for the "R" class and the misclassification rate for the "M" class respectively. tail(rf.sonar$err.rate) OOB M R [495,] [496,] [497,] [498,] [499,] [500,] shows that the black line corresponds to the overall misclassification rate, the green one to the misclassification rate for "R" and the red one to the misclassification rate for "M". The legend can be added with: plot(rf.sonar, main="evolution of misclassification rates") legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R"))

6 Evolution of misclassification rates Error overall class M class R trees The graphic shows that the three final misclassification rates are close, all equal to about 20%. 7. Make a barplot of the variable importances ranked in decreasing order. Comment. Answer: ordered.variables <- order(rf.sonar$importance[,3], decreasing=true) barplot(rf.sonar$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

7 Variable importance (decreasing order) V11 V12 V49 V9 V27 V43 V10 V26 V16 V31 V36 V58 V22 V37 V23 V52 V35 V45 V15 V20 V21 V48 V44 V51 V28 V4 V33 V17 V55 V1 V54 V14 V41 V46 V29 V39 V19 V53 V38 V24 V47 V59 V7 V13 V34 V57 V25 V8 V30 V42 V60 V6 V18 V2 V50 V56 V5 V40 V3 V Two variables have a particularly large importance in the accuracy of the forest (V11 and V12) and three other are rather importance also (V49, V9 and V27). 8. With the function predict.randomforest, predict the output of the random forest on the test set and find the test misclassification rate. Comment. Answer: The output of the random forest for the test set are found with: pred.test <- predict(rf.sonar, sonar.test) Hence, the misclassification rate for the test set is equal to: sum(pred.test!= sonar.test$class)/length(pred.test) [1] which is smaller than the OOB misclassification rate and seems to show that the model does not overfit. 9. Use the package e1071 and the function tune.randomforest to find, with a 10-fold cross validation performed on the training set only, which pair of parameters is the best among ntree=c(500, 1000, 1500, 2000) and mtry=c(5, 7, 10, 15, 20). Set the option importance to TRUE to obtain the variable importances for the best model. Use the functions summary.tune and plot.tune to interpret the result.

8 library(e1071) Answer: The tuning is performed with: res.tune.rf <- tune.randomforest(sonar.train[,1:60], sonar.train[,61], mtry=c(5, 7, 10, 15, 20), ntree=c(500, 1000, 1500, 2000), importance=true) which can be analyzed with: summary(res.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree best performance: Detailed performance results: mtry ntree error dispersion plot(res.tune.rf)

9 Performance of randomforest' ntree mtry The best parameters obtained by cross validation are a number of trees equal to 2000 and a number of variables used for every split equal to 15. The CV error is then a bit smaller (about 17%) than the OOB error obtained with the default parameters. 10. From the previous question, extract the best model (in the object obtained with the function tune.randomforest, it is in $best.model). Compare its OOB error and its evolution, its most important variables and its test error with the random forest obtained with default parameters. Comment. Answer: The best model is obtained with rf.sonar.2 <- res.tune.rf$best.model rf.sonar.2 Call: best.randomforest(x = sonar.train[, 1:60], y = sonar.train[, 61], mtry = c(5, 7, 10, 15, 20) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 15 OOB estimate of error rate: 19% Confusion matrix: M R class.error M

10 R The OOB error is thus equal to 19% compared to 20% with the default parameter. The evolution of the OOB error is obtained with: plot(rf.sonar.2) legend("topright", col=c("black", "red", "green"), lwd=2, lty=1, legend=c("overall", "class M", "class R")) rf.sonar.2 Error overall class M class R trees that gives very similar results than with the random forest with default parameters. The important variables are given with the following barplot: ordered.variables <- order(rf.sonar.2$importance[,3], decreasing=true) barplot(rf.sonar.2$importance[ordered.variables,3], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

11 Variable importance (decreasing order) V11 V49 V12 V27 V9 V43 V52 V31 V15 V10 V26 V28 V21 V36 V37 V23 V16 V35 V17 V20 V1 V58 V48 V45 V47 V22 V51 V54 V33 V44 V18 V50 V46 V39 V7 V14 V4 V25 V34 V13 V8 V19 V55 V32 V60 V40 V30 V2 V38 V24 V42 V29 V59 V5 V3 V56 V53 V6 V41 V The five most important variables are the same than the ones found in the previous forest (but not in the same order). Finally, the test error rate is obtained with: pred.test.2 <- predict(rf.sonar.2, sonar.test) sum(pred.test.2!= sonar.test$class)/length(sonar.test$class) [1] The test error rate is a bit larger than in the previous case, which is expected: the forest is based on trees that are more flexible (more variables used for the splits) and thus tends to overfit more than the previous model. Exercice 2 Comparison between random forest and CART boosting This exercise s aim is to compare random forest and CART boosting from an accuracy point of view. The notion of importance is also introduced for CART boosting. The comparison is illustrated on the dataset Vehicle data(vehicle) (package mlbench) which purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. 1. Using

12 help(vehicle) read the description of the dataset. How many potential predictors are included in the dataset and what are their types? How many observations are included in the dataset? Answer: The dataset contains 846 observations on 18 predictors, all of them being numerical. The last variable of the dataset Class defines the class (bus, opel, saab or van) of the observation, which is to be predicted. 2. Split the data set into a training and a test sets, having respective sizes 400 and 446. Answer: The data set is split into a training and test sets with: ind.test <- sample(1:nrow(vehicle), 446, replace=false) vehicle.test <- Vehicle[ind.test, ] vehicle.train <- Vehicle[-ind.test, ] 3. Using the function tune.randomforest, train a random forest with possible number of trees 500, 1000, 2000 and 3000 and possible number of variables selected to every split 4, 5, 6, 8, 10. Keep the variable importances and use the options xtest and ytest to obtain the test misclassification error directly. Analyze the results of the tuning process. Warning: When using the options xtest and ytest, the forest is not kept unless setting keep.forest=true which means that it cannot be used to make predictions with new data. For using it with the method tune, you must thus set keep.forest=true or the function will return an error while trying to compute the CV error. Answer: The forest is tuned with the following command line: vehicle.tune.rf <- tune.randomforest(vehicle.train[,1:18], vehicle.train[,19], xtest=vehicle.test[,1:18], ytest=vehicle.test[,19], mtry=c(4, 5, 6, 8, 10), ntree=c(500, 1000, 2000, 3000), importance=true, keep.forest=true) The performances of the different pairs of parameters can be visualized with: summary(vehicle.tune.rf) Parameter tuning of 'randomforest': - sampling method: 10-fold cross validation - best parameters: mtry ntree best performance: Detailed performance results: mtry ntree error dispersion

13 plot(vehicle.tune.rf) 3000 Performance of randomforest' ntree mtry The CV misclassification errors range from to with a minimum obtained for 1000 trees trained from 4 variables at each split. 4. Analyze the best forest obtained with the tuning process in the previous question in term of out-of-bag error, test error and their evolutions. What are the most important variables in this model? Answer: The best model obtained by the tuning process is obtained with: vehicle.rf <- vehicle.tune.rf$best.model vehicle.rf Call: best.randomforest(x = vehicle.train[, 1:18], y = vehicle.train[, 19], mtry = c(4, 5, 6, 8, 1 Type of random forest: classification Number of trees: 1000

14 No. of variables tried at each split: 4 OOB estimate of error rate: 28.5% Confusion matrix: bus opel saab van class.error bus opel saab van Test set error rate: 25.56% Confusion matrix: bus opel saab van class.error bus opel saab van which shows that the OOB misclassification rate is about 28.5% mainly due to a difficulty to discrimate between the classes opel and saab. The test misclassification error is slightly smaller (about 25.56%) with the same problem to discriminate opel and saab observations. The evolution of the misclassification errors is displayed using plot(vehicle.rf, main="error evolutions for the dataset 'Vehicle'") legend(700, 0.45, col=1:5, lty=1, lwd=2, cex=0.8, legend=c("overall", levels(vehicle$class)))

15 Error evolutions for the dataset 'Vehicle' Error overall bus opel saab van trees This figure shows that the misclassification rate is stabilized (or even slightly increases) at the end of the training process (when the number of trees in the forest exceed 500 approximately). The two classes opel and saab exibit the worse misclassification rates, as already explained. Finally, the most important variables are ordered.variables <- order(vehicle.rf$importance[,5], decreasing=true) barplot(vehicle.rf$importance[ordered.variables,5], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

16 Variable importance (decreasing order) Max.L.Ra Sc.Var.maxis Elong Sc.Var.Maxis D.Circ Scat.Ra Max.L.Rect Holl.Ra Comp Pr.Axis.Rect Skew.Maxis Kurt.Maxis Circ Rad.Ra Pr.Axis.Ra Ra.Gyr Skew.maxis Kurt.maxis One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 5 variables (SC.Var.maxis, Elong, Sc.Var.Maxis, D.Circ and Scat.Ra) with a large importance compared to the others. 5. This question aims at building a simple bagging of trees to compare with the previous the random forest. The package boot and rpart library(boot) library(rpart) will be used: to do so, answer the following questions: (a) Define a function boot.vehicle <- function(x, d) {... } that returns 3 values: the OOB misclassification rate for the classification tree trained with the subsample x[d, ]; the test misclassification rate for the same classification tree; the misclassification rates obtained with the same classification tree on OOB observations when one of the variable has been permuted (it is thus a vector of size 18, one value for each variable).

17 Answer: boot.vehicle <- function(x, d) { inbag <- x[d, ] oobag <- x[-d, ] cur.tree <- rpart(class~., data=inbag, method="class") # OOB error cur.oob <- predict(cur.tree, oobag) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) # test error cur.test <- predict(cur.tree, vehicle.test) cur.test <- sum(vehicle.test$class!= colnames(cur.test)[apply(cur.test, 1, which.max)])/ nrow(cur.test) # OOB shuffled error shuffled.oob <- sapply(1:(ncol(oobag)-1), function(ind) { new.oob <- oobag new.oob[,ind] <- oobag[sample(1:nrow(oobag), nrow(oobag), replace=false), ind] cur.oob <- predict(cur.tree, new.oob) cur.oob <- sum(oobag$class!= colnames(cur.oob)[apply(cur.oob, 1, which.max)])/ nrow(cur.oob) return(cur.oob) }) return(c(cur.oob, cur.test, shuffled.oob)) } (b) Use the function defined in the previous question with boot to obtain bootstrap estimates of the OOB misclassification rate (computed only from the OOB observations in the training data set) and of the test misclassification rate for a bagging of R trees with R being the optimal parameter found in question 3. Answer: Results of the bagging approach are obtained from: res.bagging <- boot(vehicle.train, boot.vehicle, R=vehicle.tune.rf$best.parameters$ntree) which gives the following OOB misclassification rate: mean(res.bagging$t[,1]) [1] and the following test misclassification rate: mean(res.bagging$t[,2]) [1] (c) Use the result of the previous function to compute and represent the variable importances. Answer: A variable s importance is the mean accuracy of the difference between the OOB accuracy and the accuracy obtained with permuted values. difference.acc <- sweep(res.bagging$t[,3:20], 1, res.bagging$t[,1], "-") cart.importance <- apply(difference.acc, 2, mean) names(cart.importance) <- colnames(vehicle.test)[1:18] ordered.variables <- order(cart.importance, decreasing=true) barplot(cart.importance[ordered.variables], col="orange", cex.names=0.7, main="variable importance (decreasing order)", las=3)

18 Variable importance (decreasing order) Max.L.Ra Sc.Var.Maxis Elong Sc.Var.maxis D.Circ Comp Scat.Ra Holl.Ra Max.L.Rect Skew.maxis Skew.Maxis Circ Kurt.maxis Rad.Ra Kurt.Maxis Pr.Axis.Ra Pr.Axis.Rect Ra.Gyr One variable (Max.L.Ra, max.length aspect ratio) is more important than the others with another set of 3 variables (SC.Var.maxis, Elong and Sc.Var.Maxis) with a large importance compared to the others. 6. Analyze the results (OOB and test errors, important variables) of the bagging of classification trees by comparing them with the ones obtained from a random forest. Answer: Misclassification rates (for both OOB and test sets) are smaller with random forests than with bagging of trees. Bagging of trees even shows a tendancy to overfit the data (with a larger test error than OOB error). The variable importances are consistent between the two methods with a larger set of important variables for random forest than for bagging of CART.