A Genetic Algorithm for Constructing Compact Binary Decision Trees

OURNL OF PTTERN REOGNITION RESERH (9) -3 Received ul 7, 8. ccepted Feb 6, 9. Genetic lgorithm for onstructing ompact inary ecision Trees Sung-Hyuk ha harles Tappert omputer Science epartment, Pace University 86 edford Road, Pleasantville, Ne York, 57, US scha@pace.edu ctappert@pace.edu bstract Tree-based classifiers are important in pattern recognition and have been ell studied. lthough the problem of finding an optimal decision tree has received attention, it is a hard optimization problem. Here e propose utilizing a genetic algorithm to improve on the finding of compact, near-optimal decision trees. We present a method to encode and decode a decision tree to and from a chromosome here genetic operators such as mutation and crossover can be applied. Theoretical properties of decision trees, encoded chromosomes, and fitness functions are presented. eyords: inary ecision Tree, Genetic lgorithm.. Introduction ecision trees have been ell studied and idely used in knoledge discovery and decision support systems. Here e are concerned ith decision trees for classification here the leaves represent classifications and the branches represent feature-based splits that lead to the classifications. These trees approximate discrete-valued target functions as trees and are a idely used practical method for inductive inference []. ecision trees have prospered in knoledge discovery and decision support systems because of their natural and intuitive paradigm to classify a pattern through a sequence of questions. lgorithms for constructing decision trees, such as I3 [-3], often use heuristics that tend to find short trees. Finding the shortest decision tree is a hard optimization problem [4, 5]. ttempts to construct short, near-optimal decision trees have been described (see the extensive survey [6]). Gehrke et al developed a bootstrapped optimistic algorithm for decision tree construction [7]. For continuous attribute data, Zhao and Shirasaka [8] suggested an evolutionary design, ennett and lue [9] proposed an extreme point tabu search algorithm, and Pajunen and Girolami [] exploited linear independent component analysis (I) to construct binary decision trees. Genetic algorithms (Gs) use an optimization technique based on natural evolution [,,, ]. Gs have been used to find near-optimal decision trees in tofold. On the one hand, they ere used to select attributes to be used to construct decision trees in a hybrid or preprocessing manner [3-5]. On the other hand, they ere applied directly to decision trees [6, 7]. problem that arises ith this approach is that an attribute may appear more than once in the path of the tree. In this paper, e describe an alternate method of constructing near-optimal binary decision trees proposed succinctly in [8]. In order to utilize genetic algorithms, decision trees must be represented as chromosomes here genetic operators such as mutation and crossover can be applied. The main contribution of this paper is proposing a ne scheme to encode and decode a decision tree to and from a chromosome. The remainder of the paper is organized as follos. Section revies decision trees and defines a ne function denoted as d to describe the complexity. Section 3 presents the encoding/decoding c 9 PRR. ll rights reserved. Permissions to make digital or hard copies of all or part of this ork for personal or classroom use is granted ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherise, or to republish, requires a fee and/or special permission from PRR.

H N TPPERT (a) T consistent to (b) T x inconsistent to Fig. : ecision trees consistent and inconsistent ith. decision trees to/from chromosomes hich stems from d, genetic operators like mutation and crossover, fitness functions and their analysis. Finally, Section 4 concludes this ork.. Preliminary Let be a set of labeled training data, a database of instances represented by attribute-value pairs here each attribute can have a small number of possible disjoint values. Here e consider only binary attributes. Hence, has n instances here each instance x i consists of d ordered binary attributes and a target value hich is one of c states of nature,. The folloing sample database here n = 6, d = 4, c =, and = {,} ill be used for illustration throughout the rest of this paper. x x = x 3 x 4 x 5 x 6 lgorithms to construct a decision tree take a set of training instances as input and output a learned discrete-valued target function in the form of a tree. decision tree is a rooted tree T that consists of internal nodes representing attributes, leaf nodes representing labels, and edges representing the attributes possible values. ranches represent the attributes possible values and in binary decision trees, left branches have values of and right branches have values of as shon in Fig.. For simplicity e omit the value labels in some later figures. decision tree represents a disjunction of conjunctions. In Fig. (a), for example, T represents the and states as ( ) ( ) ( ) and ( ) ( ) ( ), respectively. Each conjunction corresponds to a path from the root to a leaf. decision tree based on a database ith c number of classes is a c-class classification problem. ecision trees classify instances by traversing from root node to leaf node. The classification process starts from the root node of a decision tree, tests the attribute specified at this node, and then moves don the tree branch according to the attribute value given. Fig. shos to decision trees, T and T x. The decision tree T is said to be a consistent decision tree because it is consistent

GENETI LGORITHM FOR ONSTRUTING OMPT INRY EISION TREES (a) T ith nodes (b) T 3 ith 9 nodes Fig. : To decision trees consistent ith : (a) by I-3 and (b) by G. ith all instances in. Hoever, the decision tree T x is inconsistent ith because x s class is actually in hereas T x classifies it as. There are to important properties of a binary decision tree: Property The size of a decision tree ith l leaves is l. Property The loer and upper bounds of l for a consistent binary decision tree are c and n: c l n. The number of leaves in a consistent decision tree must be at least c in the best cases. In the orst cases, the number of leaves ill be the size of ith each instance corresponding to a unique leaf, e.g., T and T.. Occams Razor and I3 mong numerous decision trees that are consistent ith the training database of instances, Fig. shos to of them. ll instances x = {x,...,x 6 } are classified correctly by both decision trees T and T 3. Hoever, an unknon instance,,,,?, hich is not in the training set, is classified differently by the to decision trees; T classifies the instance as hereas T 3 classifies it as. This inductive inference is a fundamental problem in machine learning. The minimum description length principle formalized from Occam s Razor [9] is a very important concept in machine learning theory [, ]. lbeit controversial, many decision-tree building algorithms such as I3 [3] prefer smaller (more compact, shorter depth, feer nodes) trees and thus the instance,,,,? is preferably classified as by T 3 because T 3 is shorter than T. In other ords, T 3 has a simpler description than T. The shorter the tree, the feer the number of questions required to classify instances. ased on Occam s Razor, Ross Quinlan proposed a heuristic that tends to find smaller decision trees [3]. The algorithm is called I3 (Iterative ichotomizer 3) and it utilizes the Entropy hich is a measure of homogeneity of examples as defined in the equation. Entropy() = x ={,} P(x)log P(x) () 3

H N TPPERT L x x x3 x5 x6 gain... LL x x x6 gain.33.33 LLL x x6 gain gain R.9 x4 LR x3 x5 gain LLR x LRR LRL x3 x5 Fig. 3: Illustration of the I3 algorithm. Information gain or simply gain is defined in terms of Entropy here X is one of attributes in. When all attributes are binary type, the gain can be defined as in the equation. ( L gain(, X) = Entropy() Entropy( L) + ) R Entropy( R) () The I3 algorithm first selects the attribute hose gain is the maximum as a root node. For all subtrees of the root node, it finds the next attribute hose gain is the maximum iteratively. Fig. 3 illustrates the I3 algorithm. Starting ith the root node, it evaluates all attributes in the database. Since the attribute has the highest gain, the attribute is selected as a root node. Then it partitions the database into to sub databases: L and R. For each sub-database, it calculates the gain. s a result, T decision tree is built. Hoever, as is apparent from Fig. the I3 algorithm does not necessarily find the smallest decision tree.. omplexity of d Function To the extent that smaller trees are preferred, it becomes interesting to find a smallest decision tree. Finding a smallest decision tree is an NP-complete problem though [4, 5]. So as to comprehend the complexity of finding a smallest decision tree, consider a full binary decision tree, T f (the superscript f denotes full), here each path from the root to a leaf contains all the attributes exactly once as exemplified in Fig. 4. There are d leaves here d + is the height of the full decision tree. In order to build a consistent full binary tree, one may choose any attribute as a root node, e.g., there are four choices in Fig. 4. In the second level nodes, one can choose any attribute that is not in the root node. For example, in Fig. 4 there are 3 3 4 possible full binary trees. We denote the number of possible full binary tree ith d attributes as d. efinition The number of possible full binary tree ith d attributes is formally defined as 4

GENETI LGORITHM FOR ONSTRUTING OMPT INRY EISION TREES {x,x,x3,x4,x5,x6} {x, x5} {x,x3,x4,x6} {x} {x5} {x,x3,x6} {x4} d + {x} {x5} {x3,x6} {x} {x4} {x} {x5} {x6} {x3} {x} {x4} d Fig. 4: sample full binary decision tree structure T f. Table : omparison of d and d! functions. d d d! = =!= = =!= 3 3 = 3 = 3!= 6 4 4 = 3 3 4 = 576 4!= 4 5 5 = 3 3 3 3 4 4 5 = 65888 5!= 6 6 = 6 3 8 4 4 5 6 = 659764 6!= 7 7 7 = 3 3 6 4 8 5 4 6 7 =.984e + 7 7!= 54 8 8 = 64 3 3 4 6 5 8 6 4 7 8 =.935e + 55 8!= 43 9 9 = 8 3 64 4 3 5 6 6 8 7 4 8 9 = 7.6395e + 9!= 3688 = 56 3 8 4 64 5 3 6 6 7 8 84 9 = 5.836e + 4!= 3688 d = d i d i. (3) i= s illustrated in Table, as d gros, the function d gros faster than all polynomial, exponential, and even factorial functions. The factorial function d! is the product of all positive integers less than or equal to d and many combinatorial optimization problems have the complexity of O(d!) search space. The search space of full binary decision trees is much larger, i.e., d = Ω(d!). Loer and upper bounds of d are Ω( d ) and o(d d ). Note that real decision trees, such as T in Fig. (a), can be sparse because some internal nodes can be leaves as long as they are homogeneous. Hence, the search space for finding a shortest binary decision tree can be smaller than d. 3. Genetic lgorithm for inary ecision Tree onstruction Genetic algorithms can provide good solutions to many optimization problems [, ]. They are based on natural processes of evolution and the survival-of-the-fittest concept. In order to use the 5

H N TPPERT {~4} T e 3 3 4 {~3} 3 3 3 {~} (a) {~4} {~3} {~} S f 3 3 (b) Fig. 5: Encoding and decoding schema: (a) encoded tree for T f scheduling string. in Fig. 4 and (b) its chromosome attribute-selection genetic algorithm process, one must define at least the folloing four steps: encoding, genetic operators such as mutation and crossover, decoding, and fitness function. 3. Encoding For genetic algorithms to construct decision trees the decision trees must be encoded so that genetic operators, such as mutation and crossover, can be applied. Let = {a,a,...,a d } be the ordered attribute list. We illustrate and describe the process by considering the full binary decision tree T f in Fig. 4, here = {,,,}. Graphically speaking, the encoding process converts the attribute names in T f into the index of the attribute according to the ordered attribute list,, recursively, starting from the root as depicted in Fig. 5. For example, the root is and its index in is 3. Recursively, for each sub-tree, update to {} = {,,} attribute list. The possible integer values at a node in the i th level in the encoded decision tree T e are from to d i +. Finally, take the breadth-first traversal to generate the chromosome string S, hich stems from d function. For T f the chromosome string S is given in Fig. 5 (b). T and T 3 in Fig. is encoded into S = 3,,3,,,, here can be any number ithin the restricted bounds. T and T 3 in Fig. are encoded into S = 4,,,,,, and S 3 =,,,,,,, respectively. Let us call this a chromosome attribute-selection scheduling string, S, here genetic operators can be applied. Properties of S include: Property 3 The parent position of position i is i/, except for i =, the root. Property 4 The left and right child positions of position i are i and i +, respectively, if i d ; otherise, there are no children. Property 5 The length of the S s is exponential in d : S = d. Property 6 Possible integer values at position i are to d log(i + ) : S i {,...,d log(i + ) }. Property 6 provides the restricted bounds for each position of S. 6

GENETI LGORITHM FOR ONSTRUTING OMPT INRY EISION TREES {~4} {~3} {~} S f 3 3 3 4 3 S 4 S 5 4 3 3 f T4 f T5 Fig. 6: Mutation operator. Property 7 The number of possible S, π(s), is d or π(s) = S (d log(i + ) ) i. i= The loer and upper bounds of π(s) are Ω( S ) and o(d S ). 3. Genetic Operators To of the most common genetic operators are mutation and crossover. The mutation operator is defined as changing the value of a certain position in a string to one of the possible values in the range. We illustrate the mutation process on the attribute selection scheduling string S f = 3,,3,,,, in Fig. 6. If a mutation occurs in the first position and changes the value to 4, hich is in the range {,..,4}, T f 4 is generated. If a mutation happens in the third position and changes the value to, hich is in the range {,..,3}, then T f 5 is generated. s long as the changed value is ithin the alloed range, the resulting ne string alays generates a valid full binary decision tree. onsider S x = 4,,3,,,,. full binary decision tree is built according to this schedule. The final decision tree for S x ill be T Fig., i.e., S x = 4,,,,,,. There are 3 = equivalent schedules that produce T. Therefore, for the mutation to be effective e apply the mutation operators only to those positions that are not. To illustrate the crossover operator, consider the to parent attribute selection scheduling strings, P and P, in Fig. 7. fter randomly selecting a split point, the first part of P and the last part of P contribute to yield a child string S 6. Reversing the crossover produces a second child S 7. The resulting full binary decision trees for these to children are T f 6 and T f 7, respectively. 3.3 ecoding ecoding is the reverse of the encoding process in Fig. 5. Starting from the root node, e place the attribute according to the chromosome schedule S hich contains the index values of attribute list. When an attribute a is selected, is divided into left and right branches L and R. L consists of all the x i having a value of and R consists of all the x i having a value of. For each pair of sub-trees e repeat the process recursively ith the ne attribute list = {a}. When a node becomes homogeneous, i.e., all class values in are the same, e label the leaf. Fig. 8 displays the decision trees from S 4, S 5, S 6, and S 7, respectively. 7

H N TPPERT P : P : 3 3 4 3 S 6 : S 7 : 3 4 3 3 f f T6 T7 Fig. 7: rossover operator. T 4 T 5 T 6 T 7 Fig. 8: ecoded decision trees. Sometimes a chromosome introduces mutants. For instance, consider a chromosome S 8 3,3,,,,, hich results T 8 in Fig. 9. The occurs hen at the node attribute a has non-homogeneous labels but the a column in has either all s or all s. In other ords, the a attribute provides no information gain. We refer to such a decision tree as a mutant tree for to reasons. First, hat values should e put in? The label may be chosen at random but L provides no clue. Second, if e allo entering a value for, it may violate the Property ; the number of leaves l may exceed n. Indeed, T 8 behaves identically to T 6 ith respect to. Thus, mutant decision trees ill not be chosen as the fittest trees according to the fitness functions presented in the later section. 3.4 Fitness Functions Each attribute selection scheduling string S must be evaluated according to a fitness function. We consider to cases: in the first case contains no contradicting instances and d is a small finite number, in the other case d is very large. ontradicting instances are those hose attribute values are 8

GENETI LGORITHM FOR ONSTRUTING OMPT INRY EISION TREES L x x5 T 8 R x x3 x4 x6 LL x x5 RL x3 x4 x6 RR x LLL x LLR x5 RLL x4 x6 RLR x3 Fig. 9: Mutant binary decision tree. identical but their target values are different. The case ith small d and no contradicting instances has application to netork function representation []. Theorem Every attribute selection scheduling string S produces a consistent full binary decision tree as long as the set of instances contains no contradicting instances. Proof Every attribute selection scheduling string S produces a full binary decision tree, e.g., in Fig. 4. Since each instance in corresponds to a certain branch, add the leaves ith target values in the set of instances accordingly. If the target value is not available in the training set, add a random target value. This becomes a complete truth table as the number of leaves is d. The predicted value by the decision tree is alays consistent ith the actual target value in. If there are to instances ith same attribute values but different classes, no consistent binary decision tree exists. Since the databases e consider contain no contradicting instances, one obvious fitness function is the size of the tree; the fitness function f s is the size, the total number of nodes, e.g., f s (T ) = and f s (T 3 ) = 9. Here, hoever, e use a different fitness function f d, i.e., the sum of the depth of the leaf nodes for each instance. This is equivalent to the sum of questions to be asked for each instance, e.g., f d (T ) = 8 and f d (T 3 ) = f d (T 6 ) = 5 as shon in Table. This requires the assumption that each instance in has equal prior probability. oth of these evaluation functions suggest that T 3 and T 6 are better than T hich is produced by the I3 algorithm. With to decision trees of the same size (l ) here l is the number of leaves, the number of questions to be asked could be different by theorem. Theorem There exist T x and T y such that f s (T x ) = f s (T y ) and f d (T x ) < f d (T y ). Proof Let T x be a balanced binary tree and T y be a skeed or linear binary tree. Then f d (T x ) = Θ(n log n) hereas F d (T y ) = Θ(n ) = Θ( n i). i= 9

H N TPPERT Table : Training instances and their depths in decision trees. T T T 3 T 4 T 5 T 6 T 7 T 8 x 4 3 3 3 3 x 3 3 4 4 x 3 4 3 3 4 3 3 3 x 4 3 3 3 3 x 5 3 3 3 3 x 6 4 4 3 4 4 3 4 3 sum 7 8 5 8 7 5 8 7 The f d fitness function prefers not only shorter decision trees to larger ones, but also balanced decision trees to skeed ones. orollary n log c f d (T x ) n i here n is the number of instances. i= Proof y the Property, in the best case the decision tree ill have l = c leaves. In the best case, the tree is completely balanced ith height log c. Thus the loer bound for f d (T x ) is n log c. In the orst case, the number of leaves is n by the Property and the decision tree forms a skeed or linear binary tree. Thus the upper bound is n i. i= If f d (T x ) < n log c, T x classifies only a subset of classes. Regardless of the size, this tree should not be selected. If f d (T x ) > n i, there is a mutant node and T x can be shortened. i= When d is large, the size of the chromosome, S ill explode by the Property 5. In this case e must limit the height of the decision tree. The chromosome has a finite length and guides to select attributes up to a certain height. Since n d typically, a good choice of the height of the decision tree is h = log n, i.e., S n. When a branch reaches a height h, the node is assigned to be a leaf ith the class hose prior probability is the highest instead of choosing another attribute. When the height is limited, decision trees may be no longer consistent to. Suppose that the height is limited to 3; h = 3 in Fig. 3 case. Fig. shos the pruned binary decision tree, T 9. In the 4th node in breadth first traversal order, instead of choosing attribute in Fig. 3, is labeled because the prior probability of is higher than in LL. When the prior probabilities are the same, either one can be chosen. Let f a be the fitness function that represents the percentage of correctly classified instances by the decision tree. In Fig., {x,x,x 4,x 5 } are correctly predicted by T 9 and hence f a (T 9 ) = 4/6. We randomly generated a 6 binary attribute training database and limited the tree height to 8. Fig. shos the initial population of binary decision trees generated by random chromosomes in respect to their f d (T x ) and accuracy f a (T x ) on the training examples. The size of decision trees and their accuracies on the training examples have no correlations. 4. iscussion In this paper, e revieed binary decision trees and demonstrated ho to utilize genetic algorithms to find compact binary decision trees. y limiting the tree s height the presented method guarantees finding a better or equal decision tree than the best knon algorithms since such trees can be put in the initial population. Methods that apply Gs directly to decision trees [6, 7] can yield subtrees that are never visited as shon in Fig.. fter mutation operator in O node in T a, T c has a dashed subtree that

GENETI LGORITHM FOR ONSTRUTING OMPT INRY EISION TREES L x x x3 x5 x6 R x4 LL x3 x x5 x x6 T 9 LR x x x3 x4 x5 x6 Predicted by T 9 Fig. : Pruned decision tree at the height of 3. Fig. : inary decision trees ith respect to f d and accuracy f a. is never visited. fter crossover beteen T a and T b, the child trees T d and T e also have dashed subtrees. These unnecessary subtrees occur henever an attribute occurs more than once in a path of a decision tree. Hoever, by encoding the decision tree, this problem never occurs as illustrated in Fig. 3. When d is large, limiting the height as recommended. Encoding the binary decision tree in this paper utilized the breadth first traversal. Hoever, pre-order depth first traversal can be utilized as shon in Fig. 3. Mutation shon in this paper is still valid in pre-order depth first traversal representation but crossover may cause a mutant hen to subtrees to be sitched are at different levels.

H N TPPERT The index number of a node may exceed the limit due to a crossover. Thus mutation immediately after crossover is inevitable. nalyzing and implementing the depth first traversal of encoded decision tree remains ongoing ork. Encoding and decoding non-binary decision trees here different attributes have different possible values is an open problem. References [] Mitchell, T. M., Machine Learning, McGra-hill, 997. [] uda, R. O., Hart, P. E., and Stork,. G., Pattern lassification, nd Ed., Wiley interscience,. [3] Quinlan,. R., Induction of decision trees, Machine Learning, (), 8-6. [4] L. Hyafil and R. L. Rivest, onstructing optimal binary decision trees is NP-complete, Information Processing Letters, Vol. 5, No., 5-7, 976. [5] odlaender, L.H. and Zantema H., Finding Small Equivalent ecision Trees is Hard, International ournal of Foundations of omputer Science, Vol., No. World Scientific Publishing,, pp. 343-354. [6] Safavian, S.R. and Landgrebe,., survey of decision tree classifier methodology,ieee Transactions on Systems, Man and ybernetics, Vol, No. 3, pp 66-674, 99. [7] Gehrke,., Ganti, V., Ramakrishnan R., and Loh, W., OT-Optimistic ecision Tree onstruction, in /it Proc. of the M SIGMO onference on Management of ata, 999, p69-8. [8] Zhao, Q. and Shirasaka, M., Study on Evolutionary esign of inary ecision Trees, in Proceedings of the ongress on Evolutionary omputation, Vol 3, IEEE, 999, pp. 988-993. [9] ennett,. and lue,., Optimal decision trees, Tech. Rpt. No. 4 epartment of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, Ne York., 996. [] Pajunen, P. and Girolami, M., Implementing decisions in binary decision trees using independent component analysis, in Proceedings of I,, pp. 477-48. [] Goldberg. L., Genetic lgorithms in Search, Optimization, and Machine Learning, ddison-wesley, 989. [] Mitchell, M., n Introduction to Genetic lgorithms, Massachusetts Institute of Technology, 996. [3] young Min im, oong o Park, Myung Hyun Song, In heol im, and hing Y. Suen, inary ecision Tree Using Genetic lgorithm for Recognizing efect Patterns of old Mill Strip, LNS Vol. 36, Springer, 4, pp. 6-3349. [4] Teeusen, S.P., Erlich, I., El-Sharkai, M.., and achmann, U., Genetic algorithm and decision treebased oscillatory stability assessment, IEEE Transactions on Poer Systems, Vol, Issue, May 6 pp. 746-753. [5] ala,., Huang,., Vafaie, H., eong,., and Wechsler, H., Hybrid learning using genetic algorithms and decision tress for pattern classification. in Proceedings of the 4th International oint onference on rtificial Intelligence, Montreal, anada, 995, pp. 79-74. [6] Papagelis,. and alles,., G Tree: genetically evolved decision trees, in Proceedings of th IEEE International onference on Tools ith rtificial Intelligence,, pp. 3-6. [7] Fu, Z., n Innovative G-ased ecision Tree lassifier in Large Scale ata Mining, LNS Vol. 74, Springer, 999, pp 348-353. [8] S. ha and.. Tappert, onstructing inary ecision Trees using Genetic lgorithms, in Proceedings of International onference on Genetic and Evolutionary Methods, uly 4-7, 8, Las Vegas, Nevada. [9] P. Grnald, The Minimum escription Length principle, MIT Press, une 7. [] Martinez, T. R. and ampbell,. M., Self-Organizing inary ecision Tree For Incrementally efined Rule ased Systems, Systems, IEEE Systems, Man, and ybernetics, Vol., No. 5, pp.3-38, 99.

GENETI LGORITHM FOR ONSTRUTING OMPT INRY EISION TREES T a T b O R O V Mutate M rossover T c T d T e O R O V M M Fig. : irect G on decision trees. e T a 9 e T b 5 7 3 8 e T f e T g e T h 5 9 9 7 3 8 Tf Tg T h O R L L N V N L Fig. 3: Pre-order depth first traversal for decision trees. 3