GIVEN a collection of organisms (or taxa), the objective

Transcription

1 Using Decision Trees to Study the Convergence of Phylogenetic Analyses Grant Brammer and Tiffani L. Williams Abstract In this paper, we explore the novel use of decision trees to study the convergence properties of phylogenetic analyses. A decision learning tree is constructed from the evolutionary relationships (or bipartitions) found in the evolutionary trees returned from a phylogenetic analysis. We treat evolutionary trees returned from multiple runs of a phylogenetic analysis as different classes. Then, we use the depth of a decision tree as a technique to measure how distinct the runs are from each other. Decision trees with shallow depth reflect nonconvergence since the evolutionary trees can be classified with little information. Deep decision tree depths reflect convergence. We study Bayesian and maximum parsimony phylogenetic analyses consisting of thousands of trees. For some datasets studied here, a single distinguishing bipartition can classify the entire tree collection suggesting non-convergence of the underlying phylogenetic analysis. Thus, we believe that decision trees lead to new insights with the potential for helping biologists reconstruct more robust evolutionary trees. I. INTRODUCTION GIVEN a collection of organisms (or taxa), the objective of a phylogenetic analysis is to produce an evolutionary tree describing the genealogical relationships between the taxa. Bayesian inference, as implemented in the MrBayes [1] software package, is one of the most popular approaches for reconstructing evolutionary trees. Although Bayesian inference is very powerful, closed-form solutions are unlikely given the problem sizes of interest (hundreds to thousands of taxa eventually scaling to reconstructing the Tree of Life, which is estimated to contain between 10 and 100 million taxa). Markov Chain Monte Carlo (MCMC) sampling is a very versatile yet computationally intensive procedure, which produces samples of parameter values from the posterior distribution. Unfortunately, the main problem with MCMC simulations is assessing whether they have converged. The resulting samples come from the true distribution only after convergence [2]. For phylogenetic inference, non-convergence implies that the MCMC sampling did not sample from the distribution of the true evolutionary tree leading to an inaccurate estimate of the true tree. In this paper, we use decision tree learning as a novel approach to study whether a phylogenetic analysis has converged. A phylogenetic analysis takes as input a set of molecular sequences and outputs typically thousands of trees. Our objective is to examine these output trees for evidence that the phylogenetic analysis converged. Traditionally, life scientists take the collection of output trees produced by a phylogenetic analysis and summarized them with a single Grant Brammer and Tiffani Williams are with the Department of Computer Science and Engineering, Texas A&M University, College Station, Texas, USA ( {grb, tlw}@cse.tamu.edu). consensus tree. However, this results in an underestimation of the data [3], [4]. Furthermore, consensus trees provide little insight when studying the convergence properties of the underlying phylogenetic analysis. Many phylogenetic methods employ computational intelligence techniques to infer the true evolutionary tree. In our work, we use a traditional machine learning approach (decision trees) to detect whether a phylogenetic analysis converged. In our construction of decision trees, the evolutionary relationships contained in the phylogenetic trees are the interior nodes in our decision trees. The leaves of the decision tree reflect the run that produced the trees of interest. Since the most popular phylogenetic techniques (such as Bayesian inference) attempt to solve NP-hard optimization problems, multiple runs of a phylogenetic algorithm or heuristic are executed in order to escape local optima and ensure convergence. However, there are no good measures for whether either of these goals are achieved by running a phylogenetic technique multiple times. We believe decision trees can assist in this regard. The depth of a decision tree is the crucial feature for determining whether a phylogenetic analysis, consisting of multiple runs, has converged. We define the depth of a decision tree to be the length of the longest path from the root to a leaf node. Deep (high depth) decision trees reflect a high level of sharing among the evolutionary trees since many evolutionary relationships have to be consulted in order to classify what runs produced the evolutionary trees. Shallow decision trees reflect less sharing of information of across the runs. Hence, deep decision trees provide strong evidence that a phylogenetic analysis converged while extremely shallow trees reflect non-convergence. Our work leverages the idea that the depth of a decision tree can be used to measure the quality of a cluster. A similiar idea has been used to develop new clustering methods [5]. Decision trees and feature selection are well researched areas in the machine learning community but, they have not yet been throughly applied to phylogenetic data sets. In many ways, the work most similar in methods to ours is that which attempts to measure the quality of clusters produced by clustering phylogenetic trees. Stockham, Wang, and Warnow [4] cluster phylogenetic trees and use the concept of information loss as a measure of cluster quality, but do not address the issue of convergence. Unlike our work, most techniques for detecting convergence use tree scores [6], but rarely do they compare the evolutionary relationships contained in the trees to each other. Comparing tree scores can be misleading since trees with similar scores are not necessarily close in

2 parameter (tree) space, leading to misleading results [7], [8], [9], [10]. In our experiments, we study three biological tree collections containing thousands of evolutionary trees consisting of (i) 4,898, (ii) 20,000, and (iii) 33,306 evolutionary trees. The smallest tree collection was the result of using two maximum parsimony techniques, parsimony ratchet [11] and Rec-I- DCM3. The larger collections, which are published tree analyses were obtained from life scientists, were the result of a Bayesian analysis. Using the depth of a decision tree as our convergence criteria, our results show that the maximum parsimony analysis demonstrated convergence while both Bayesian analyses showed non-convergence. Surprisingly, non-convergence was depicted by a decision tree of depth one. For example, a single evolutionary relationship can be used to classify the whole 20,000 tree Bayesian data set. Thus, we believe that decision trees lead to new insights with the potential for helping biologists reconstruct more robust evolutionary trees. II. BASICS A. Evolutionary Trees and Their Newick Representation An evolutionary tree (or phylogeny) is a depiction of the evolutionary relationships between a set of taxa (or organisms). Fig. 1 shows an example phylogenetic tree on six taxa. A phylogenetic tree can be uniquely defined by its set of bipartitions (or edges). When removed, a bipartition partitions the taxa into two sets. In Fig. 1, consider bipartition b 0 in tree t 0. It partitions the set of taxa into two groups: taxa a and b on one side and taxa c, d, e, and f on the other side. We represent this bipartition as ab cdef, which is also shared between trees t 1 and t 2. A set of bipartitions uniquely defines an evolutionary tree. In this paper, we only consider nontrivial bipartitions (internal edges). For a given set of n taxa, every tree will have bipartitions relating to external edges (i.e., edges connecting directly to taxa). Most phylogenetic trees are unrooted since determining the root is a complex procedure. The Newick format [12] is the most widely used format to store a phylogenetic tree in a file. In this format, the topology of the evolutionary tree is represented using a notation based on balanced parentheses. Consider the evolutionary tree in Fig. 1. A Newick representation of the unrooted topology of this tree is ((a, b), (c, d)), (e, f));, where ; symbolizes the end of the Newick string. Matching pairs of parentheses symbolize internal nodes in the evolutionary tree. The Newick representation of a tree is not a unique. For example, another valid Newick string for tree t 0 is ((e, f), ((d, c), (b, a)));. B. Reconstructing Evolutionary Trees In order to study the convergence properties of phylogenetic analyses using decision trees, we consider two types of phylogenetic analyses: Bayesian and Maximum Parsimony inference. In our experiments, MrBayes [1], a very popular software package for phylogenetics, is used to reconstruct the evolutionary trees based on Bayesian inference. Under maximum parsimony, we use two hill-climbing heuristics, Rec-I-DCM3 [13] and parsimony ratchet [11], to reconstruct the phylogenies. a) Bayesian inference: The Bayesian approach is based on a quantity called the posterior probability of a tree. Bayes theorem Pr(Tree Data) = Pr(Data Tree) Pr(Tree) Pr(Data) is used to combine the prior probability of a phylogeny (Pr(Tree)) with the likelihood (Pr(Data Tree)) to produce a posterior probability distribution on trees (Pr(Tree Data)). The posterior probability represents the probability that the tree is correct. Inferences about the history of the group are then based on the posterior probability of trees. Typically all trees are considered a priori equally probable, and likelihood is calculated using a substitution model of evolution. Computing the posterior probability involves a summation over all trees, and, for each tree, integration over all possible combinations of branch length and substitution model parameter values. A number of numerical methods are available to allow the posterior probability to be approximated. The most common is Markov Chain Monte Carlo (MCMC). For the phylogeny problem, the MCMC algorithm involves two steps. First, a new tree is proposed by stochastically perturbing the current tree. Afterwards, the tree is either accepted or rejected with a probability described by Metropolis et al. [14] and Hastings [15]. If the new tree is accepted, then it is subjected to more perturbations. For a properly constructed and adequately run Markov chain, the proportion of the time that any tree is visited is a valid approximation of the posterior probability of that tree. b) Maximum parsimony: The maximum parsimony (MP) optimization criterion for inferring the evolutionary history of different taxa assumes that each of the taxa in the input is represented by a string over some alphabet. The symbols in the alphabet can represent nucleotides (in which case, the input are DNA or RNA sequences), or amino-acids (in which case the input are protein sequences), or may even include discrete characters for morphological properties. It is also assumed that the strings are put into a multiple alignment, so that they all have the same length. Maximum parsimony then seeks a tree, along with inferred ancestral sequences, so as to minimize the total number of evolutionary events by counting only point mutations. Most of the powerful heuristics [16], [11], [17] for solving the maximum parsimony problem incorporate hill-climbing heuristics in their design. Hill-climbing heuristics take an initial estimate (e.g., user-provided, random, or random sequence addition tree) of the phylogeny and rearrange branches in it to reach neighboring trees. If a rearrangement yields a better scoring tree, it becomes the new best tree and it is then submitted to a new round of rearrangements. The process continues until no better tree can be found in a full round.

3 d c e c d e f b9 b f b10 b f b12 b e b14 b0 a d b13 b0 a c b11 b0 t0 t1 t2 a Fig. 1. Three evolutionary trees on six taxa. The taxa are labeled a,b,...f. The bipartitions (b 0,b 9,b 10,b 11,b 12, and b 14 ) are labeled according to Fig. 2(b). C. Consensus Trees Both Bayesian and maximum parsimony inference can easily produce thousands of trees. Consensus trees summarize the information contained in a set of evolutionary trees whose leaves are all from the same set of taxa. Strict consensus is the most conservative of the consensus methods and produces a tree with only those phylogenetic relationships that are supported by all the source trees. A majority rule consensus tree contains those relationships that appear in more than half of the source trees. Consider the three trees in Fig. 1. The bipartition b 0 appears in each of the trees. As a result, the unrooted Newick representation of the strict consensus tree will be ((a, b), (c, d, e, f));. For the majority tree, the only bipartition that appears in over a majority of the trees is b 0. Hence, for this example, the strict and majority consensus for the phylogenies in Fig. 1 are the same. Also note, that all of the bipartitions in a strict consensus tree will also be contained in the majority tree. III. CONSTRUCTING DECISION TREES FOR PHYLOGENETIC ANALYSES Now, we give our approach for constructing decision trees for a collection of k evolutionary trees obtained from a phylogenetic analysis. Our approach consists of two steps. First, we construct a bipartition table that consists of the unique bipartitions associated with our set of k phylogenetic trees. The bipartition table also contains class labels concerning the run of the phylogenetic search heuristic that produced each of the k evolutionary trees. The second step concerns using the bipartition table to actually create the decision using the ID3 (Iterative Dichotomiser 3) data-mining algorithm [18]. A. Step 1: Constructing a Bipartition Table Consider the set of twelve evolutionary trees depicted in Fig. 2(a). First, we enumerate all of the bipartitions in each of the evolutionary trees. For example, tree t 0, which is shown pictorially in Fig. 1, has bipartitions ab cdef, cd abef, and ef abcd and they are given bipartition identities (BIDs) b 0, b 9, and b 14, respectively. The BIDs for the unique bipartitions can be assigned in any manner. However, in our example, the bipartitions are ordered in lexicographical order and hence the BIDs are assigned appropriately. Fig. 2(b) shows the set of unique bipartitions for the twelve trees in our example. Each unique bipartition, b i, will represent a column in the bipartition table. The rows of the table will consist of the input tree ids (t i ) of interest. For each column labeled b i, a 0 or 1 is assigned to each row based on whether tree t i contains that bipartition. Consider Table I. Bipartition b 0 appears in trees t 0,t 1,t 2, and t 3. It does not appear in the remaining eight trees. Once each bipartition (or feature) is encoded for each of the rows, the final step is to encode the known class labels for each tree. In our study, there are two class labels of interest: run and algorithm. Oftentimes, a phylogenetic technique is executed multiple times in order to improve the convergence rate of the underlying phylogenetic analysis. For each run, we know what trees were produced from it. Finally, a life scientist, in addition to running a phylogenetic multiple times, might also execute different phylogenetic techniques or algorithms. Hence, we could use the type of algorithm used to produce a tree as a class label. In the bipartition table shown in Table I, we only use a run class label, where we assume that two different runs of a phylogenetic algorithm produced the set of twelve trees. B. Step 2: Building a Decision Tree Once the bipartition table is constructed, we use our implementation of the ID3 algorithm, which is a top-down greedy approach that selects features based on information gain. For our problem, the features are bipartitions, which are the internal nodes of the decision tree. Class labels (runs and algorithm) are the leaves of the decision tree. Class labels are also considered target features. The ID3 algorithm takes a feature set (the presence or absence of each bipartition) and a target feature (i.e., which run the evolutionary trees came from) and computes a decision tree, where each node in the tree represents the a bipartition and each edge represents the presence or absence of that bipartition. Table I shows an example bipartition table as derived from the Newick strings in Fig. 2(a). Note the last column in the bipartition table is run label and not bipartition information. The ID3 algorithm computes the information gain of each bipartition relative to the run label. Information gain represents how well the bipartition information correlates with the run labels. The bipartition with the best information gain is selected and added as the root of the decision tree. In our example, bipartition b 0 is the root (see Fig. 3). All the trees with bipartition b 0 come from run r 1. Hence, the

4 t 0 : (((a, b), (c, d)), (e, f)); t 1 : (((a, b), (c, e)), (d, f)); t 2 : (((a, b), (c, f)), (d, e)); t 3 : (((a, b), (c, d)), (e, f)); t 4 : (((a, c), (b, d)), (e, f)); t 5 : (((a, e), (b, d)), (c, f)); t 6 : (((a, d), (b, c)), (e, f)); t 7 : (((a, d), (b, c)), (e, f)); t 8 : (((a, d), (b, f)), (c, e)); t 9 : (((a, f), (b, c)), (d, e)); t 10 : (((a, f), (b, d)), (c, e)); t 11 : (((a, e), (b, c)), (d, f)); b 0 : ab cdef b 1 : ac bdef b 2 : ad bcef b 3 : ae bcdf b 4 : af bcde b 5 : bc adef b 6 : bd acef b 7 : be acdf b 8 : bf acde b 9 : cd abef b 10 : ce acef b 11 : cf acde b 12 : de abcf b 13 : df abce b 14 : ef abcd (a) Newick strings (b) Unique bipartitions Fig. 2. Twelve evolutionary trees used as input to build the bipartition table in Table I. (a) shows the Newick representation of the twelve phylogenetic trees of interest. (b) provides a listing of the unique bipartitions that appear across the twelve trees. TABLE I A BIPARTITION TABLE DEPICTING THE PRESENCE ( 1 ) OR ABSENCE ( 0 ) OF EACH BIPARTITION (LABELED b i ), IN THE TWELVE TREES SHOWN IN FIG. 2ALONG WITH A COLUMN DENOTING THE RUN THAT GENERATED THE TREE. THE LIST OF BIPARTITIONS ARE OUR FEATURES AND THE RUN IS OUR CLASS LABEL SINCE WE KNOW WHAT TREES A PARTICULAR RUN GENERATED. IN THIS EXAMPLE, THERE ARE TWO RUNS THAT GENERATED THE TWELVE TREES. ABIPARTITION TABLE WILL SERVE AS INPUT FOR THE CONSTRUCTION OF A DECISION TREE. Bipartition table Trees b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 run label t r 1 t r 1 t r 1 t r 1 t r 1 t r 1 t r 2 t r 2 t r 2 t r 2 t r 2 t r 2 1 r1 b0 0 b6 Depth 0 Depth 1 r2 1 b4 1 0 r1 0 r2 Depth 2 Depth 3 Fig. 3. A decision tree constructed based on the input from Table I. Here, 3 bipartitions (out of 15) are needed to classify the twelve input trees by what run generated them. right child node is marked as run r 1 and has no children. The trees that do not have bipartition b 0 are from both runs so the process repeats to calculate the node with the most information gain. The process continues until all the branches terminate into leaf nodes. Fig. 3 shows the resulting decision tree. Equation (1) measures the entropy of a collection of trees. c Entropy(S) = p i log 2 p i (1) i=1 For a given set S of examples, the sum function iterates over all the different example values with p i being the percentage of the examples that have that value. To compute the entropy of the run labels, S is the rightmost column of Table I. We would iterate over the labels r 1 and r 2 with p i being the percentage of each label. Equation (2) computes information gain, where S is the set of examples and A is an attribute. S v Gain(S, A) =Entropy(S) S Entropy(s v) v Values(A) (2) Values(A) is the set of values that the attribute can have. For each v in Values(A), S v is the a subset of S which has that value.

5 IV. OUR DATASETS The biological trees used in this study were obtained from three recent Bayesian analysis and a maximum parsimony (MP) analysis, which we describe below. The evolutionary trees from Datasets #1 and #2 were obtained from published analyses performed by biologists. The trees from Dataset #3 were obtained by us as a result of using well-recognized MP heuristics. A. Dataset #1: 20,0000 Trees on 150 Taxa We received 20,000 trees from a Bayesian analysis of an alignment of 150 taxa (23 desert taxa and 127 others from freshwater, marine, ands oil habitats) with 1,651 aligned sites [19]. Two independent runs consisting of 25 million generations (trees were sampled every 1,000 generations) were performed using the GTR+I+Γ model in MrBayes with four independent chains. The authors constructed a majority consensus tree in their study using the 20,000 trees from the last 10 million generations from each of the two runs. B. Dataset #2: 33,306 trees on 567 Taxa We obtained 33,306 trees from a Bayesian analysis of a three-gene, 567 taxa (560 angiosperms, seven outgroups) dataset with 4,621 aligned characters, which is one of the largest Bayesian analysis done to date [20]. Twelve runs, with four chains each, using the GTR+I+Γ model in MrBayes ran for at least 10 million generations. Trees were sampled every 1,000 generations. The authors discuss the difficulties with combining trees from multiple runs. To obtain our collection of 33,306 trees, we discard the trees from the first 8 million generations. C. Dataset #3: 4,898 Trees on 567 Taxa We inferred 4,898 trees from a maximum parsimony (MP) analysis of a set of 567 three-gene (rbcl, atpb, and 18s) aligned DNA sequences (2,153 sites) of angiosperms [21]. We used two MP algorithms, parsimony ratchet [11] and Rec- I-DCM3 to obtain the evolutionary trees. Each MP algorithm created 5,000 trees for a total of 10,000 trees over 567 taxa. The parsimony ratchet algorithm used in this paper is called Pauprat since we used a Perl script by Bininda-Emonds [22] to generate a PAUP* [17] batch file to run the parsimony ratchet heuristic. However, we were not interested in all 10,000 trees. Of these 10,000 trees, we were only interested in those trees that are near-optimal. Biologists typically only use the topscoring (or near-optimal) in their published phylogenetic studies. In our experiments, we use parsimony trees that are step 0,step 1, and step 2 away from the best-known maximum parsimony score for this dataset, which is 44,165. Let x represent the parsimony score of a tree t i. Then, tree t i is x b steps away from the best score. In our experiments, step 0,step 1, and step 2 represents trees that are 0, 1, and 2 steps away from the best score, b, respectively. Between the two algorithms, there are 4,898 trees that fit this criteria. V. RESULTS AND DISCUSSION In our experiments, we compute the depths of the decision trees built from our collection of phylogenetic trees to study convergence. Shallow decision trees may represent that a phylogenetic analysis did not converge. Deeper decision trees would symbolize convergence. A. Dataset #1: 150 Taxa Bayesian Trees As explained in Section IV-A, two runs produced the 20,000 evolutionary trees for this dataset. Each run, r 0 and r 1, consisted of 10,000 trees. Surprisingly, after applying the ID3 algorithm to these to compare the two runs, a decision tree of depth one was created. In other words, there exists a single bipartition b i that appears in all of the evolutionary trees from run r 0,butb i does not appear in any evolutionary trees in run r 1. Hence, a single bipartition can distinguish between the two sets of 20,000 evolutionary trees! We refer to such a bipartition as a distinguishing bipartition. Given that a single bipartition can classify the two phylogenetic runs, there is poor mixing of solutions from the different runs, which is symptomatic of non-convergence. Even more interesting, is that for this dataset, distinguishing bipartitions would not appear in the strict or majority consensus tree. Since each run contains an equal number of trees, a distinguishing bipartition for this dataset cannot appear in over half of the trees, which would be a requirement to appear in the majority tree. If a bipartition is not in the majority tree, it cannot appear in the strict consensus, which requires the bipartition to appear in all 20,000 of the trees. Information related to distinguished bipartitions would be over looked by a consensus tree based analysis. Decision trees of depth one mean that there is at least one distinguishing bipartition but, how many distinguishing bipartitions exist for this dataset? For the 150 taxa data set there are four bipartitions that one run contains but the other does not. For this dataset, two distinguishing bipartitions appeared every in evolutionary tree in run r 0 but not r 1 and there are two bipartitions that appeared every time in run r 1 but not run r 0. Distinguishing bipartitions represent phylogenetic information that is strong in one region of tree space (returned every time in run 0) and weak in others (returned none of the time in run 1). If the phylogenetic analysis was executed only once, there would have been very strong support or lack of support (depending on the run) for these bipartitions. B. Dataset #2: 567 Taxa Bayesian Trees Next, we turn to the 567 taxa set of 33,306 Bayesian trees. As explained in Section IV-B, these evolutionary trees were collected over twelve runs of using MrBayes [1], one of the most popular software packages for reconstructing evolutionary trees. Table II shows the number of trees that were produced by each run. Similarly to the 150 taxa dataset of Bayesian trees, our target feature is run labels. In our experiments, we created decision trees to measure the similarity of each pair of runs. The upper triangle of

6 TABLE II THE NUMBER OF 567 TAXA BAYESIAN TREES IN EACH RUN. THERE ARE 12 TOTAL RUNS AND 33,306 TOTAL TREES. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r TABLE III BY COMPARING PAIRS OF BAYESIAN RUNS, THE UPPER TRIANGLE SHOWS THE DEPTHS OF THE CONSTRUCTED DECISION TREES. THE LOWER TRIANGLE PROVIDES THE NUMBER OF DIFFERENT BIPARTITIONS THAT CAN ACCOUNT FOR A DECISION TREE OF DEPTH ONE. DECISION TREES OF DEPTHS OTHER THAN ONE ARE SHADED IN GRAY. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r r r r r r r r r r r r TABLE IV THE UPPER TRIANGLE OF THIS TABLE SHOWS THE NUMBER OF UNIQUE BIPARTITIONS WHEN COMPARING PAIRS OF BAYESIAN RUNS. THE LOWER TRIANGLE DEPICTS THE PERCENTAGE OF TOTAL BIPARTITIONS USED IN THE DECISION TREE. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r r r r r r r r r r r r Table III shows the depths of these decision trees. Similarly to the 150 taxa Bayesian trees, there are distinguishing bipartitions. However, given that there are twelve runs (instead of two), there are many trees of depth one. The few values greater than one are shaded in gray in the table. For such run comparisons, a single distinguishing bipartition cannot classify the two sets of trees. For example, classifying runs r 1 and r 5 require looking at 11 bipartitions (or a decision tree of depth 11) in order to determine the run that generated one of the 6,163 trees. Although the two runs do not contain the same trees, there are no bipartitions that appeared in all of one run and none of the other. The decision must be made by looking at a combination bipartitions instead of a single bipartition. Deeper decision trees require more bipartitions to distinguish the sets of evolutionary trees, which in turn reflects good mixing of the underlying phylogenetic information (bipartitions) and is a sign of convergence between the two runs. If two runs contained identical trees, it is impossible for the ID3 algorithm to separate the trees into two different classes

7 based on the same bipartition information. Hence, the fact that we were able to produce these decision tree means that no two runs returned the same tree. Next, we count the number of distinguishing bipartitions that create decision tree of depth one. The lower triangle of Table III shows the results. For example, there are 41 distinguishing bipartitions separating runs r 2 and r 6. The presences of distinguishing bipartitions suggests that the phylogenetic analysis that produced these trees did not converge. There are strong distinctions between the bipartitions reported across runs. Moreover, the various runs appear to be stuck in local optima and return results influenced by their independent areas of the exponentially-sized tree space. Table IV shows that the number of distinguishing bipartitions increases drastically when comparing pairs of runs. C. Dataset #3: 567 Taxa Maximum Parsimony Trees For our final analysis, we apply decision trees to evolutionary trees obtained from two maximum parsimony (MP) heuristics, Rec-I-DCM3 and Pauprat. Table V shows the number of evolutionary trees collected from each run. Each of the MP heuristics required five runs to produce its set of evolutionary trees. Runs r 0,r 1,r 2,r 3, and r 4 were generated by Pauprat while the remaining runs were obtained from Rec- I-DCM3. Similarly to the Bayesian tree datasets, Table VI provides the depth of the resulting decision trees using runs as the target feature. The results show that the Rec-I-DCM3 runs are quite self-similar. Many of the run-by-run comparisons are non-separable, which we denote by the label NS. In other words, the two runs being compared returned at least one identical tree. As a result, the bipartition information cannot separate the evolutionary trees across the runs. In comparison to the Bayesian trees, the MP trees are more similar. Given the depth-level of the resulting decision trees, the MP trees represent a better example of convergence than their Bayesian counterparts. Based on the percentages of bipartitions used to create the the decision trees from the lower triangles of Tables IV and VII, the parsimony trees are up to two orders of magnitude more similar to each other than their Bayesian counterparts. VI. CONCLUSIONS AND FUTURE WORK Determining whether a phylogenetic analysis has converged is an important problem in phylogenetics. Without convergence, the robustness of a phylogenetic analysis is unclear and leads to inaccurate hypotheses of how the taxa evolved from a common ancestor. Popular phylogenetic approaches often leverage computational intelligence techniques to infer the true evolutionary tree. In this paper, we have shown how to use the depth of a decision tree (a traditional machine learning technique) as a measure of convergence for a phylogenetic analysis. Currently, there are relatively few techniques available to help biologists determine whether their analyses have converged. In addition to decision tree depth, the novelty of our work is in using bipartitions as the foundation for determining whether there is sufficient mixing of information to justify convergence. In our study of three, large biological tree collections obtained from Bayesian and maximum parsimony analyses, we showed that non-convergence was a property of the Bayesian analysis, which resulted in decision trees of depth one. That is, one bipartition is sufficient to classify thousands of trees into two groups (or runs). The maximum parsimony trees, on the other hand, resulted in decision trees of high depth, which is a requirement for phylogenetic convergence. We believe decision trees are a step forward in terms of defining concrete measures of convergence that biologists can use to build more robust and accurate evolutionary trees. Moreover, our results can be used potentially to help design better phylogenetic heuristics especially as it relates to avoiding getting trapped in local optima which is a major source of non-convergence. Our future work includes studying more datasets with our new convergence measure, incorporating tree scores into convergence framework, comparing decision trees to topology measures such as the Robinson-Foulds distance [23], and expanding our understanding of distinguishing bipartitions. VII. ACKNOWLEDGMENTS Funding for this project was supported by the National Science Foundation under grants DEB and IIS REFERENCES [1] J. P. Huelsenbeck and F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees, Bioinformatics, vol. 17, no. 8, pp , [2] J. Peltonen, J. Venna, and S. Kaski, Visualizations for assessing convergence and mixing of markov chain monte carlo simulations, Comput. Stat. Data Anal., vol. 53, no. 12, pp , [3] D. M. Hillis, T. A. Heath, and K. S. John, Analysis and visualization of tree space, Syst. Biol, vol. 54, no. 3, pp , [4] C. Stockham, L. S. Wang, and T. Warnow, Statistically based postprocessing of phylogenetic analysis by clustering, in Proceedings of 10th Int l Conf. on Intelligent Systems for Molecular Biology (ISMB 02), 2002, pp [5] B. Liu, Y. Xia, and P. S. Yu, Clustering through decision tree construction, in CIKM 00: Proceedings of the ninth international conference on Information and knowledge management. New York, NY, USA: ACM, 2000, pp [6] A. Rambaut and A. J. Drummond, Tracer v1.4, [Online]. Available: [7] J. P. Huelsenbeck, B. Larget, R. Miller, and F. Ronquist, Potential applications and pitfalls of bayesian inference of phylogeny, Syst. Biol., vol. 51, p. 673, [8] J. A. A. Nylander, J. C. Wilgenbusch, D. L. Warren, and D. L. Swofford, Awty (are we there yet?): a system for graphical exploration of mcmc convergence in bayesian phylogenetics. Bioinformatics, vol. 24, no. 4, pp , [Online]. Available: bioinformatics24.html#nylanderwws08 [9] N. J, Bayesian phylogenetic analysis of combined data, Syst. Biol., vol. 53, p. 47, [10] S.-J. Sul, S. Matthews, and T. L. Williams, Using tree diversity to compare phylogenetic heuristics, BMC Bioinformatics, vol. 10 (Suppl 4), no. S3, [11] K. C. Nixon, The parsimony ratchet, a new method for rapid parsimony analysis, Cladistics, vol. 15, pp , [12] J. Felsenstein, The Newick tree format, Internet Website, last accessed, September 2009, newick URL: washington.edu/phylip/newicktree.html.

8 TABLE V THE NUMBER OF MAXIMUM PARSIMONY TREES IN EACH RUN. RUNS r 0 TO r 4 WERE OBTAINED FROM PAUPRAT. THE REMAINING TREES WERE COLLECTED FROM REC-I-DCM3. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r TABLE VI THE DEPTH OF THE DECISION TREES WHEN COMPARING PAIRS OF MAXIMUM PARSIMONY RUNS. NSSTANDS FOR NON-SEPARABLE. THE BIPARTITIONS FROM EACH RUN ARE SO SIMILAR IN THESE CASES THAT THE ID3 ALGORITHM CAN NOT CLASSIFY THEM. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r r r r r r NS NS 39 NS r NS NS NS r NS r NS TABLE VII THE UPPER TRIANGLE OF THIS TABLE OF THIS TABLE SHOWS THE NUMBER OF UNIQUE BIPARTITIONS WHEN COMPARING PAIRS OF MAXIMUM PARSIMONY RUNS. THE LOWER TRIANGLE DEPICTS THE PERCENTAGE OF TOTAL BIPARTITIONS USED IN THE DECISION TREE. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r r r r r r r NS r NS NS r NS r NS NS NS NS - [13] U. Roshan, B. M. E. Moret, T. L. Williams, and T. Warnow, Rec-I-DCM3: a fast algorithmic techniques for reconstructing large phylogenetic trees, in Proc. IEEE Computer Society Bioinformatics Conference (CSB 2004). IEEE Press, 2004, pp [14] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys., vol. 21, pp , [15] W. Hastings, Monte carlo sampling methods using markov chains and their applications, Biometrika, vol. 57, pp , [16] P. A. Goloboff, J. S. Farris, and K. C. Nixon, TNT, a free program for phylogenetic analysis, Cladistics, vol. 24, no. 5, pp , [17] D. L. Swofford, PAUP*: Phylogenetic analysis using parsimony (and other methods), 2002, sinauer Associates, Underland, Massachusetts, Version 4.0. [18] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, [19] L. A. Lewis and P. O. Lewis, Unearthing the molecular phylodiversity of desert soil green algae (chlorophyta), Syst. Bio., vol. 54, no. 6, pp , [20] D. E. Soltis, M. A. Gitzendanner, and P. S. Soltis, A 567-taxon data set for angiosperms: The challenges posed by bayesian analyses of large data sets, Int. J. Plant Sci., vol. 168, no. 2, pp , [21] D. E. Soltis, P. S. Soltis, M. W. Chase, M. E. Mort, D. C. Albach, M. Zanis, V. Savolainen, W. H. Hahn, S. B. Hoot, M. F. Fay, M. Axtell, S. M. Swensen, L. M. Prince, W. J. Kress, K. C. Nixon, and J. S. Farris, Angiosperm phylogeny inferred from 18s rdna, rbcl, and atpb sequences, Botanical Journal of the Linnean Society, vol. 133, pp , [22] O. Bininda-Emonds, Ratchet implementation in PAUP*4.0b10, 2003, available from Emonds. [23] D. F. Robinson and L. R. Foulds, Comparison of phylogenetic trees, Mathematical Biosciences, vol. 53, pp , 1981.