Classification/Decision Trees (II)

Size: px

Start display at page:

Download "Classification/Decision Trees (II)"

Patrick Atkins
8 years ago
Views:

1 Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University

2 Right Sized Trees Let the expected misclassification rate of a tree T be R (T ). Recall the resubstitution estimate for R (T ) is R(T ) = r(t)p(t) = R(t). t T t T R(T ) is biased downward. R(t) R(t L ) + R(t R ).

3 Digit recognition example No. Terminal Nodes R(T ) R ts (T ) The estimate R(T ) becomes increasingly less accurate as the trees grow larger. The estimate R ts decreases first when the tree becomes larger, hits minimum at the tree with 10 terminal nodes, and begins to increase when the tree further grows.

91 The estimate R(T ) becomes increasingly less accurate as the trees grow larger.

4 Preliminaries for Pruning Grow a very large tree T max. 1. Until all terminal nodes are pure (contain only one class) or contain only identical measurement vectors. 2. When the number of data in each terminal node is no greater than a certain threshold, say 5, or even As long as the tree is sufficiently large, the size of the initial tree is not critical.

5 1. Descendant: a node t is a descendant of node t if there is a connected path down the tree leading from t to t. 2. Ancestor: t is an ancestor of t if t is its descendant. 3. A branch T t of T with root node t T consists of the node t and all descendants of t in T. 4. Pruning a branch T t from a tree T consists of deleting from T all descendants of t, that is, cutting off all of T t except its root node. The tree pruned this way will be denoted by T T t. 5. If T is gotten from T by successively pruning off branches, then T is called a pruned subtree of T and denoted by T T.

A branch T t of T with root node t T consists of the node t and all descendants of t in T. 4.

6 Subtrees Even for a moderate sized T max, there is an enormously large number of subtrees and an even larger number ways to prune the initial tree to them. A selective pruning procedure is needed. The pruning is optimal in a certain sense. The search for different ways of pruning should be of manageable computational load.

A selective pruning procedure is needed.

7 Minimal Cost-Complexity Pruning Definition for the cost-complexity measure: For any subtree T T max, define its complexity as T, the number of terminal nodes in T. Let α 0 be a real number called the complexity parameter and define the cost-complexity measure R α (T ) as R α (T ) = R(T ) + α T.

8 For each value of α, find the subtree T (α) that minimizes R α (T ), i.e., R α (T (α)) = min R α (T ). T T max If α is small, the penalty for having a large number of terminal nodes is small and T (α) tends to be large. For α sufficiently large, the minimizing subtree T (α) will consist of the root node only.

nodes is small and T (α) tends to be large.

9 Since there are at most a finite number of subtrees of T max, R α (T (α)) yields different values for only finitely many α s. T (α) continues to be the minimizing tree when α increases until a jump point is reached. Two questions: Is there a unique subtree T Tmax which minimizes R α (T )? In the minimizing sequence of trees T 1, T 2,..., is each subtree obtained by pruning upward from the previous subtree, i.e., does the nesting T 1 T 2 {t 1 } hold?

Two questions: Is there a unique subtree T Tmax which minimizes R α (T )?

10 Definition: The smallest minimizing subtree T (α) for complexity parameter α is defined by the conditions: 1. R α (T (α)) = min T Tmax R α (T ) 2. If R α (T ) = R α (T (α)), then T (α) T. If subtree T (α) exists, it must be unique. It can be proved that for every value of α, there exists a smallest minimizing subtree.

If R α (T ) = R α (T (α)), then T (α) T.

11 The starting point for the pruning is not T max, but rather T 1 = T (0), which is the smallest subtree of T max satisfying R(T 1 ) = R(T max ). Let t L and t R be any two terminal nodes in T max descended from the same parent node t. If R(t) = R(t L ) + R(t R ), prune off t L and t R. Continue the process until no more pruning is possible. The resulting tree is T 1.

Let t L and t R be any two terminal nodes in T max descended from the same parent node t.

12 For T t any branch of T 1, define R(T t ) by R(T t ) = R(t ), t T t where T t is the set of terminal nodes of T t. For t any nonterminal node of T 1, R(t) > R(T t ).

13 Weakest-Link Cutting For any node t T 1, set R α ({t}) = R(t) + α. For any branch T t, define R α (T t ) = R(T t ) + α T t. When α = 0, R 0 (T t ) < R 0 ({t}). The inequality holds for sufficiently small α. But at some critical value of α, the two cost-complexities become equal. For α exceeding this threshold, the inequality is reversed. Solve the inequality R α (T t ) < R α ({t}) and get α < R(t) R(T t) T t 1. The right hand side is always positive.

The inequality holds for sufficiently small α.

14 Define a function g 1 (t), t T 1 by { R(t) R(Tt), t / T g 1 (t) = T 1 t 1 +, t T 1 Define the weakest link t 1 in T 1 as the node such that and put α 2 = g 1 ( t 1 ). g 1 ( t 1 ) = min t T 1 g 1 (t).

15 When α increases, t 1 is the first node that becomes more preferable than the branch T t 1 descended from it. α 2 is the first value after α 1 = 0 that yields a strict subtree of T 1 with a smaller cost-complexity at this complexity parameter. That is, for all α 1 α < α 2, the tree with smallest cost-complexity is T 1. Let T 2 = T 1 T t 1.

α 2 is the first value after α 1 = 0 that yields a strict subtree of T 1 with a

16 Repeat the previous steps. Use T 2 instead of T 1, find the weakest link in T 2 and prune off at the weakest link node. g 2 (t) = { R(t) R(T2t ) T, t T 2t 1 2, t / T 2 +, t T 2 g 2 ( t 2 ) = min t T 2 g 2 (t) α 3 = g 2 ( t 2 ) T 3 = T 2 T t 2 If at any stage, there are multiple weakest links, for instance, if g k ( t k ) = g k ( t k ), then define T k+1 = T k T t k T t k. Two branches are either nested or share no node.

g 2 (t) = { R(t) R(T2t ) T, t T 2t 1 2, t / T 2 +, t T 2 g 2 ( t 2 ) = min t T 2 g 2 (t) α 3 = g 2 ( t 2

17 A decreasing sequence of nested subtrees are obtained: T 1 T 2 T 3 {t 1 }. Theorem: The {α k } are an increasing sequence, that is, α k < α k+1, k 1, where α 1 = 0. For k 1, α k α < α k+1, T (α) = T (α k ) = T k.

18 At the initial steps of pruning, the algorithm tends to cut off large subbranches with many leaf nodes. With the tree becoming smaller, it tends to cut off fewer. Digit recognition example: Tree T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T 10 T 11 T 12 T 13 T k

With the tree becoming smaller, it tends to cut off fewer.

19 Best Pruned Subtree Two approaches to choose the best pruned subtree: Use a test sample set. Cross-validation Use a test set to compute the classification error rate of each minimum cost-complexity subtree. Choose the subtree with the minimum test error rate. Cross validation: tree structures are not stable. When the training data set changes slightly, there may be large structural change in the tree. It is difficult to correspond a subtree trained from the entire data set to a subtree trained from a majority part of it. Focus on choosing the right complexity parameter α.

Choose the subtree with the minimum test error rate. Cross validation: tree structures are not stable.

20 Pruning by Cross-Validation Consider V -fold cross-validation. The original learning sample L is divided by random selection into V subsets, L v, v = 1,..., V. Let the training sample set in each fold be L (v) = L L v. The tree grown on the original set is T max. V accessory trees T (v) max are grown on L (v).

L v, v = 1,..., V. Let the training sample set in each fold be L (v) = L L v.

21 For each value of the complexity parameter α, let T (α), T (v) (α), v = 1,..., V, be the corresponding minimal cost-complexity subtrees of T max, T max. (v) For each maximum tree, we obtain a sequence of jump points of α: α 1 < α 2 < α 3 < α k <. To find the corresponding minimal cost-complexity subtree at α, find α k from the list such that α k α < α k+1. Then the subtree corresponding to α k is the subtree for α.

22 The cross-validation error rate of T (α) is computed by R CV (T (α)) = 1 V V v=1 N (v) miss N (v), where N (v) is the number of samples in the test set L v in fold v; and N (v) miss is the number of misclassified samples in L v using T (v) (α), a pruned tree of T (v) max trained from L (v). Although α is continuous, there are only finite minimum cost-complexity trees grown on L.

23 Let T k = T (α k ). To compute the cross-validation error rate of T k, let α k = α k α k+1. Let R CV (T k ) = R CV (T (α k )). For the root node tree {t 1 }, R CV ({t 1 }) is set to the resubstitution cost R({t 1 }). Choose the subtree T k with minimum cross-validation error rate R CV (T k ).

24 Computation Involved 1. Grow V + 1 maximum trees. 2. For each of the V + 1 trees, find the sequence of subtrees with minimum cost-complexity. 3. Suppose the maximum tree grown on the original data set T max has K subtrees. 4. For each of the (K 1) α k, compute the misclassification rate of each of the V test sample set, average the error rates and set the mean to the cross-validation error rate. 5. Find the subtree of T max with minimum R CV (T k ).

Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary