Bayesian Classification and Regression Tree Analysis (CART)

Size: px

Start display at page:

Download "Bayesian Classification and Regression Tree Analysis (CART)"

Jonathan Barnett
7 years ago
Views:

1 Bayesian Classification and Regression Tree Analysis (CART) Teresa Department of Applied Mathematics and Statistics Jack Baskin School of Engineering UC Santa Cruz March 11, 2010

3 What is CART? The general aim of classification and regression tree analysis: given a set of observations y i and associated variables x ij, i = 1 : n and j = 1 : p, find a way of using x to partition the observations into homogeneously distributed groups, then use the group to predict y. Use binary trees to recursively split observations with yes/no questions about variables in x. Assume each end or terminal node has a homogeneous distribution.

4 How do we do this? Seminal work by Breiman et al[1] was surprisingly Bayesian, involving the elicitation of priors and risk/utility functions on misclassification. However, the actual tree generation methods were still very ad-hoc. After this work was published a large number of different ad-hoc methods appear, as well as attempts to combine them to produce better inferential strategies. Methods are largely deterministic in nature and produce one tree per method.

5 Outline Going Bayesian: The Problem! p =? 1 1 Image courtesy of Diesel-stock, Diesel-stock.deviantart.com.

6 Notation Outline Notation follows that of Wu, Tjelmeland and West (WTW)[7]. Observations y i, regressors x i, i I = {1 : n}, j 1 : k. We wish to predict y Y based associated x X = X 1 X k. Nodes u with the root note denoted as node 0 and each non-terminal node u with children nodes 2u + 1 (left) and 2u + 2 (right). Trees are then defined as appropriate subsets of the set N = {0, 1, 2,... }. Write the number of nodes of a tree T as m(t ). Splitting: For each node U: Choose a predictor variable index k T (u) and a splitting threshold τ T (u) X kt (u). We then assign y to the left child of u if x kt (u) τ T (u).

7 tree tree from iris data height=4, log(p)= Petal.Width <> 1.5 Petal.Width <> 0.6 Sepal.Length <> Sepal.Length <> 5.9 8e obs obs obs obs obs

8 Likelihood Outline Each terminal node (leaf) viewed as a random sample from some distribution with density φ( θ u ) where θ u is dependently only on the leaf. Usually φ is either multinomial (categorical outcomes) or normal (continuous outcomes).

9 Tree prior Simplify by using a prior of the form p(θ, T ) = p(θ T )p(t ) and specify p(t ) implicitly by using a tree-generating process: 1. Begin by setting T to be the trivial one-node tree 2. Split a node with probability p split (u, T ) 3. If a node splits, assign a splitting rule τ T (u) according to some distribution p(τ T (u) u, T ). Update T to reflect the new tree, and repeat steps 2 and 3.

10 Tree prior (cont.) Outline Consider p split (u, T ) = α(1 + d u ) β, β 0; 0 α 1 where d n is the node depth. Consider finite splitting values. Suggestion: choose k uniformly from available predictors and then τ from the set of observed values if x k is quantitative or from the available subsets if qualitative. For Θ, use iid normal-inverse-gamma for Θ T if constructing a regression tree and Dirichlet if constructing a classification tree. CGM suggest choosing hyperparameters based on fitting a greedy tree model.

11 Fitting procedure Proceed through MCMC. Interest focuses on the steps for sampling the tree structure. CGM use a Metropolis-Hastings step with a transition kernel choosing randomly among four steps: Grow: Pick a terminal node and split into two children nodes, Prune: Pick a parent of two terminal nodes and collapse, Change: Pick an internal node and reassign the splitting rule, Swap: Pick a parent-child pair and swap splitting rules, unless the other child of the parent has the same pair, in which case give both children the splitting rule of the parent. All steps are reversible, so the Markov chain is reversible.

12 Limitations Relatively slow mixing: tendency to stay in local area Tendency to get stuck in a local mode: CGM suggest repeated restarting either from trivial tree or trees found by other methods such as bootstrap bumping No single tree output; no good way of picking one good tree from sample

13 WTW propose two significant improvements to CGM s method: Improved prior on tree structure: the pinball prior, New M-H method, tree restructure move. They also allow for infinite splitting moves, via a prior on the space of splitting values. A prior with finite point masses would duplicate that of CGM as a special case.

14 Pinball prior Idea: generate some number of terminal nodes m(t ), then cascade these nodes down the tree, randomly splitting left/right with some probability until nodes define individual leaves. Specify prior density for tree size, m(t ) α(m(t )). Natural: Poisson, m(t ) = 1 + Pois(λ) for some specified λ. Construct a prior density for splitting, β(m l(u) (T ) m u (T )), where m l(u) (T ) is the number sent left from some number m u (T ) that have cascaded down to node u. There are a number of choices for β, e.g. uniform or binomial.

15 Tree restructure move Idea: Restructure the tree branches without changing the terminal categories. Begin at node 0 Recursively identify possible splitting rules that leave terminal categories unchanged Choose some splitting rule, repeat until terminal nodes fully specified This move radically restructures the tree without affecting categorization and eliminates the tendency to get stuck near local maxima: effective exploration of posterior better mixing, better posterior inference.

16 Iris data: We wish to use sepal length and petal width to predict petal length. Divide data into two sets: 30 of each species for tree creation, 20 for evaluation. > iris.subsample.index <- c(sample(1:50, 30), sample(51:100, 30), sample(101:150, 30)) > iris.train <- iris[iris.subsample.index,] > iris.test <- iris[-iris.subsample.index,] Iris petal length Petal.Width Testing Training Sepal.Length

17 z Outline (Cont.) Using bcart in the tgb package: > bcart.iris <- bcart(x = iris.train[,c(1,4)], XX = iris.test[,c(1,4)], Z = iris.train[,3], trace = TRUE, R=5, BTE = c(2000, 10000, 2)) z mean z quantile diff (error) Petal.Width Sepal.Length Petal.Width

18 height=3, log(p)= height=4, log(p)= height=5, log(p)= Petal.Width <> 1.5 Petal.Width <> 1.5 Petal.Width <> 1.5 Sepal.Length <> 5.9 Sepal.Length <> 6.2 Petal.Width <> 0.6 Sepal.Length <> 6.2 Petal.Width <> 0.6 Sepal.Length <> 6.5 Sepal.Length <> Sepal.Length <> obs 11 obs 19 obs 8e obs obs obs Petal.Width <> obs e e obs 30 obs 17 obs 13 obs 17 obs 13 obs 24 obs 11 obs

19 Training data Testing data Observed petal length Observed petal length setosa versicolor virginica Predicted petal length Predicted petal length

20 Extensions and Future Work Implementation! Inference methods: tree averaging Beyond the Gaussian Heavy-tailed distributions Skew and count data Improved priors Improved sampling steps

21 Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth Statistics/Probability Series. Wadsworth International Group, Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayesian cart model search. Journal of the American Statistical Association, 93(443): , September Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Hierarchical priors for bayesian cart shrinkage. Statistics and Computing, 10:17 24, Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayesian treed models. Machine Learning, 48: , David G. T. Denison, Bani K. Mallick, and Adrian F. M. Smith. A bayesian cart algorithm. Biometrika, 85(2): , June Wei-Yin Loh. Classification and regression tree methods. In Ruggeri, Kenett, and Faltin, editors, Encyclopedia of Statistics in Quality and Reliability, pages Wiley, Yuhong Wu, Håkon Tjelmeland, and Mike West. Bayesan cart - prior specification and posterior simulation -. January 2006.

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge