Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Size: px

Start display at page:

Download "Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes"

Hillary Martina Welch
10 years ago
Views:

1 Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews [email protected] Tom Kelsey ID B & B 3 April Bootstrap review Resample with replacement Mimics the way that the data was selected from the population Not suitable for small data sets dirty data outliers add too much variability dependence structures time series, spatial problems, etc. bootstrap assumes independence Tom Kelsey ID B & B 3 April Bagging Important Concepts 1 The general process behind bagging of a predictive model. You should be able to describe the process algorithmically. 2 How the predictions are produced for both quantitative and categorical response type models. 3 The benefits and short-comings of bagging. In particular where you d expect a predictive process to benefit from bagging. Tom Kelsey ID B & B 3 April

uk Tom Kelsey ID5059-19-B & B 3 April 2015 1 Bootstrap review Resample with replacement Mimics the way that the data was selected from the population Not suitable for small data sets dirty data

2 History Bootstrap AGGregatING Proposed by Breiman (1996) - one of the creaters of of CART Basically boostrapping, an often used approach in statistics Benefits from being applied directly to classification/regression trees (+ the weight of the author and a very nice acronym). Tom Kelsey ID B & B 3 April Overview A high-level description of the method is simple: create a number of bootstrapped datasets, fit the predictive model to each, combine the predictions of the multiple models Note we have two classes of predictive model in mind: a quantitative response (i.e. regression type problems) or a categorical response (classification problems). Tom Kelsey ID B & B 3 April More specifically Following the original notation of Brieman 1996: Take L as some learning/training set of data consisting of y i, x i, (i = 1,.., n) By standard sampling with replacement produce M variants of these from our original data: L m (m = 1, 2,.., M) We can produce predictors for each of the L m giving ˆf m (X Lm ). Combine this multitude of models. Tom Kelsey ID B & B 3 April

Tom Kelsey ID5059-19-B & B 3 April 2015 4 Overview A high-level description of the method is simple: create a number of bootstrapped datasets, fit the predictive model to each, combine the

3 Combining bootstrapped predictions Quantitative response (i.e. a regression problem). For a given input vector x, pass this down through all M models giving M predictions. A simple mean gives the bagged prediction. Categorical response (i.e. a classification problem): seek a vote across the collective classifiers. For a given input vector x, pass this down each of our M classifiers to give predictions of class ĵ in each case. The vote is the class with the greatest frequency of predictions. Tom Kelsey ID B & B 3 April Performance The resulting aggregate predictions will have lower variability than the individual predictions that might be generated from a single classifier/model. Simulations and analyses of real data and found that the improvement in performance was marked. Across a large number of datasets reductions in misclassification rates of 6%-77%, but typically in the range of 20%-40%. Tom Kelsey ID B & B 3 April Algorithm Application to a classification tree 1 Select a bootstrap sample L B from our learning/training dataset L. 2 Create a tree φ B (X) using L B with the dataset L being the validation set to prune φ B (X). 3 Repeat this say 50 times to produce φ 1 (X),..., φ 50 (X). 4 Take some observation i with covariate vector x i - the estimated class j is that with the plurality (i.e. most common) in φ 1 (x i ),..., φ 50 (x i ). In the case of a continuous b= response take 50 φ b (x i ). Tom Kelsey ID B & B 3 April

For a given input vector x, pass this down each of our M classifiers to give predictions of class ĵ in each case. The vote is the class with the greatest frequency of predictions.

4 on this method Bagging component models are likely to have higher test-set error When combined, the models can (and often do) produce test-set error lower than that of the single model The diversity among bagging models generally compensates for the increase in error rate of any individual model Tom Kelsey ID B & B 3 April on this method The process is very simple to apply, even without dedicated software Requires only: a front-end that will perform the bootstrap sampling of the data, a predictive modelling tool, and a back-end that will gather the results together. The modelling part can be done in parallel, therefore suitable for modern cloud, grid & HPC frameworks Tom Kelsey ID B & B 3 April Pros and cons There is no particular gain for models that do not have some inherent instability (evidenced under perturbation) For example, application to a k-nearest neighbor classifier may show no improvement, whereas classification tree might show great gains. Something like ordinary regression, where an automated model selection process proves quite variable under perturbation, would simiarly benefit. Tom Kelsey ID B & B 3 April

without dedicated software Requires only: a front-end that will perform the bootstrap sampling of the data, a predictive modelling tool, and a back-end that will gather the results together.

5 Pros and cons Low gains will be expected if the base model is giving good performance already - e.g. is close to the misclassification performance boundary. There is a loss of interpretability. We have predictions from multiple models, so the exact nature of the model and the contribution of the variables is complex. Tom Kelsey ID B & B 3 April on the method The method of bagging is a variant of ensemble models More generally this falls under the developing areas of model uncertainty and model averaging. Tom Kelsey ID B & B 3 April More detail on Bagging Trees Ensemble of trees with different initial conditions" Algorithm: 1 sample from data with replacement (on average get 64% of the full data set) 2 fit a full regression tree (no pruning!) 3 calculate cross-validated stats on in-bag samples Repeat 1-3 for some number of trees generating cross validated statistics as you go Tom Kelsey ID B & B 3 April

Tom Kelsey ID5059-19-B & B 3 April 2015 13 on the method The method of bagging is a variant of ensemble models More generally this falls under the developing areas of model uncertainty and model

6 More detail on Bagging Trees CV score as a function of the number of trees can decide how big the forest should be New data: run down all trees and average their collective result (classification is popular vote) Bagging is more robust to noise and outliers: the variance of single trees is reduced by their consensus over diverse subsets of the data Tom Kelsey ID B & B 3 April Random Forests Algorithm Identical to Bagging, except: Each time a tree is fit, at each node, censor some of the predictor variables In addition to a good model we get information on variable importance and proximity of observations We scramble each predictor relative to the observations and see if it matters We can check how often pairs of observations fall into the same terminal nodes over the forest We can get good estimates for missing data values Tom Kelsey ID B & B 3 April Random Forest Results Tom Kelsey ID B & B 3 April

Random Forests Algorithm Identical to Bagging, except: Each time a tree is fit, at each node, censor some of the predictor variables In addition to a good model we get information on variable

Random Forest Wins http://www.sciencedirect.com/science/article/pii/s0925400512012671 Tom Kelsey ID5059-19-B & B 3 April 2015 19 Random Forest Results http://www.

7 Random Forest Wins Tom Kelsey ID B & B 3 April Random Forest Results Tom Kelsey ID B & B 3 April Random Forest Loses Tom Kelsey ID B & B 3 April

Random Forest Results http://www.biomedcentral.

8 Summary Bagging is easy to understand and implement: Works for any type of model, usually GLM or CART Bagging often gives great predictive accuracy But not always There is no single method that outperforms its competitors most of the time So comparative studies and competitions are useful We next consider two more modelling approaches: Boost and SVM Tom Kelsey ID B & B 3 April

outperforms its competitors most of the time So comparative studies and competitions are useful We

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15