GIVEN a collection of organisms (or taxa), the objective

Size: px
Start display at page:

Download "GIVEN a collection of organisms (or taxa), the objective"

Transcription

1 Using Decision Trees to Study the Convergence of Phylogenetic Analyses Grant Brammer and Tiffani L. Williams Abstract In this paper, we explore the novel use of decision trees to study the convergence properties of phylogenetic analyses. A decision learning tree is constructed from the evolutionary relationships (or bipartitions) found in the evolutionary trees returned from a phylogenetic analysis. We treat evolutionary trees returned from multiple runs of a phylogenetic analysis as different classes. Then, we use the depth of a decision tree as a technique to measure how distinct the runs are from each other. Decision trees with shallow depth reflect nonconvergence since the evolutionary trees can be classified with little information. Deep decision tree depths reflect convergence. We study Bayesian and maximum parsimony phylogenetic analyses consisting of thousands of trees. For some datasets studied here, a single distinguishing bipartition can classify the entire tree collection suggesting non-convergence of the underlying phylogenetic analysis. Thus, we believe that decision trees lead to new insights with the potential for helping biologists reconstruct more robust evolutionary trees. I. INTRODUCTION GIVEN a collection of organisms (or taxa), the objective of a phylogenetic analysis is to produce an evolutionary tree describing the genealogical relationships between the taxa. Bayesian inference, as implemented in the MrBayes [1] software package, is one of the most popular approaches for reconstructing evolutionary trees. Although Bayesian inference is very powerful, closed-form solutions are unlikely given the problem sizes of interest (hundreds to thousands of taxa eventually scaling to reconstructing the Tree of Life, which is estimated to contain between 10 and 100 million taxa). Markov Chain Monte Carlo (MCMC) sampling is a very versatile yet computationally intensive procedure, which produces samples of parameter values from the posterior distribution. Unfortunately, the main problem with MCMC simulations is assessing whether they have converged. The resulting samples come from the true distribution only after convergence [2]. For phylogenetic inference, non-convergence implies that the MCMC sampling did not sample from the distribution of the true evolutionary tree leading to an inaccurate estimate of the true tree. In this paper, we use decision tree learning as a novel approach to study whether a phylogenetic analysis has converged. A phylogenetic analysis takes as input a set of molecular sequences and outputs typically thousands of trees. Our objective is to examine these output trees for evidence that the phylogenetic analysis converged. Traditionally, life scientists take the collection of output trees produced by a phylogenetic analysis and summarized them with a single Grant Brammer and Tiffani Williams are with the Department of Computer Science and Engineering, Texas A&M University, College Station, Texas, USA ( {grb, tlw}@cse.tamu.edu). consensus tree. However, this results in an underestimation of the data [3], [4]. Furthermore, consensus trees provide little insight when studying the convergence properties of the underlying phylogenetic analysis. Many phylogenetic methods employ computational intelligence techniques to infer the true evolutionary tree. In our work, we use a traditional machine learning approach (decision trees) to detect whether a phylogenetic analysis converged. In our construction of decision trees, the evolutionary relationships contained in the phylogenetic trees are the interior nodes in our decision trees. The leaves of the decision tree reflect the run that produced the trees of interest. Since the most popular phylogenetic techniques (such as Bayesian inference) attempt to solve NP-hard optimization problems, multiple runs of a phylogenetic algorithm or heuristic are executed in order to escape local optima and ensure convergence. However, there are no good measures for whether either of these goals are achieved by running a phylogenetic technique multiple times. We believe decision trees can assist in this regard. The depth of a decision tree is the crucial feature for determining whether a phylogenetic analysis, consisting of multiple runs, has converged. We define the depth of a decision tree to be the length of the longest path from the root to a leaf node. Deep (high depth) decision trees reflect a high level of sharing among the evolutionary trees since many evolutionary relationships have to be consulted in order to classify what runs produced the evolutionary trees. Shallow decision trees reflect less sharing of information of across the runs. Hence, deep decision trees provide strong evidence that a phylogenetic analysis converged while extremely shallow trees reflect non-convergence. Our work leverages the idea that the depth of a decision tree can be used to measure the quality of a cluster. A similiar idea has been used to develop new clustering methods [5]. Decision trees and feature selection are well researched areas in the machine learning community but, they have not yet been throughly applied to phylogenetic data sets. In many ways, the work most similar in methods to ours is that which attempts to measure the quality of clusters produced by clustering phylogenetic trees. Stockham, Wang, and Warnow [4] cluster phylogenetic trees and use the concept of information loss as a measure of cluster quality, but do not address the issue of convergence. Unlike our work, most techniques for detecting convergence use tree scores [6], but rarely do they compare the evolutionary relationships contained in the trees to each other. Comparing tree scores can be misleading since trees with similar scores are not necessarily close in

2 parameter (tree) space, leading to misleading results [7], [8], [9], [10]. In our experiments, we study three biological tree collections containing thousands of evolutionary trees consisting of (i) 4,898, (ii) 20,000, and (iii) 33,306 evolutionary trees. The smallest tree collection was the result of using two maximum parsimony techniques, parsimony ratchet [11] and Rec-I- DCM3. The larger collections, which are published tree analyses were obtained from life scientists, were the result of a Bayesian analysis. Using the depth of a decision tree as our convergence criteria, our results show that the maximum parsimony analysis demonstrated convergence while both Bayesian analyses showed non-convergence. Surprisingly, non-convergence was depicted by a decision tree of depth one. For example, a single evolutionary relationship can be used to classify the whole 20,000 tree Bayesian data set. Thus, we believe that decision trees lead to new insights with the potential for helping biologists reconstruct more robust evolutionary trees. II. BASICS A. Evolutionary Trees and Their Newick Representation An evolutionary tree (or phylogeny) is a depiction of the evolutionary relationships between a set of taxa (or organisms). Fig. 1 shows an example phylogenetic tree on six taxa. A phylogenetic tree can be uniquely defined by its set of bipartitions (or edges). When removed, a bipartition partitions the taxa into two sets. In Fig. 1, consider bipartition b 0 in tree t 0. It partitions the set of taxa into two groups: taxa a and b on one side and taxa c, d, e, and f on the other side. We represent this bipartition as ab cdef, which is also shared between trees t 1 and t 2. A set of bipartitions uniquely defines an evolutionary tree. In this paper, we only consider nontrivial bipartitions (internal edges). For a given set of n taxa, every tree will have bipartitions relating to external edges (i.e., edges connecting directly to taxa). Most phylogenetic trees are unrooted since determining the root is a complex procedure. The Newick format [12] is the most widely used format to store a phylogenetic tree in a file. In this format, the topology of the evolutionary tree is represented using a notation based on balanced parentheses. Consider the evolutionary tree in Fig. 1. A Newick representation of the unrooted topology of this tree is ((a, b), (c, d)), (e, f));, where ; symbolizes the end of the Newick string. Matching pairs of parentheses symbolize internal nodes in the evolutionary tree. The Newick representation of a tree is not a unique. For example, another valid Newick string for tree t 0 is ((e, f), ((d, c), (b, a)));. B. Reconstructing Evolutionary Trees In order to study the convergence properties of phylogenetic analyses using decision trees, we consider two types of phylogenetic analyses: Bayesian and Maximum Parsimony inference. In our experiments, MrBayes [1], a very popular software package for phylogenetics, is used to reconstruct the evolutionary trees based on Bayesian inference. Under maximum parsimony, we use two hill-climbing heuristics, Rec-I-DCM3 [13] and parsimony ratchet [11], to reconstruct the phylogenies. a) Bayesian inference: The Bayesian approach is based on a quantity called the posterior probability of a tree. Bayes theorem Pr(Tree Data) = Pr(Data Tree) Pr(Tree) Pr(Data) is used to combine the prior probability of a phylogeny (Pr(Tree)) with the likelihood (Pr(Data Tree)) to produce a posterior probability distribution on trees (Pr(Tree Data)). The posterior probability represents the probability that the tree is correct. Inferences about the history of the group are then based on the posterior probability of trees. Typically all trees are considered a priori equally probable, and likelihood is calculated using a substitution model of evolution. Computing the posterior probability involves a summation over all trees, and, for each tree, integration over all possible combinations of branch length and substitution model parameter values. A number of numerical methods are available to allow the posterior probability to be approximated. The most common is Markov Chain Monte Carlo (MCMC). For the phylogeny problem, the MCMC algorithm involves two steps. First, a new tree is proposed by stochastically perturbing the current tree. Afterwards, the tree is either accepted or rejected with a probability described by Metropolis et al. [14] and Hastings [15]. If the new tree is accepted, then it is subjected to more perturbations. For a properly constructed and adequately run Markov chain, the proportion of the time that any tree is visited is a valid approximation of the posterior probability of that tree. b) Maximum parsimony: The maximum parsimony (MP) optimization criterion for inferring the evolutionary history of different taxa assumes that each of the taxa in the input is represented by a string over some alphabet. The symbols in the alphabet can represent nucleotides (in which case, the input are DNA or RNA sequences), or amino-acids (in which case the input are protein sequences), or may even include discrete characters for morphological properties. It is also assumed that the strings are put into a multiple alignment, so that they all have the same length. Maximum parsimony then seeks a tree, along with inferred ancestral sequences, so as to minimize the total number of evolutionary events by counting only point mutations. Most of the powerful heuristics [16], [11], [17] for solving the maximum parsimony problem incorporate hill-climbing heuristics in their design. Hill-climbing heuristics take an initial estimate (e.g., user-provided, random, or random sequence addition tree) of the phylogeny and rearrange branches in it to reach neighboring trees. If a rearrangement yields a better scoring tree, it becomes the new best tree and it is then submitted to a new round of rearrangements. The process continues until no better tree can be found in a full round.

3 d c e c d e f b9 b f b10 b f b12 b e b14 b0 a d b13 b0 a c b11 b0 t0 t1 t2 a Fig. 1. Three evolutionary trees on six taxa. The taxa are labeled a,b,...f. The bipartitions (b 0,b 9,b 10,b 11,b 12, and b 14 ) are labeled according to Fig. 2(b). C. Consensus Trees Both Bayesian and maximum parsimony inference can easily produce thousands of trees. Consensus trees summarize the information contained in a set of evolutionary trees whose leaves are all from the same set of taxa. Strict consensus is the most conservative of the consensus methods and produces a tree with only those phylogenetic relationships that are supported by all the source trees. A majority rule consensus tree contains those relationships that appear in more than half of the source trees. Consider the three trees in Fig. 1. The bipartition b 0 appears in each of the trees. As a result, the unrooted Newick representation of the strict consensus tree will be ((a, b), (c, d, e, f));. For the majority tree, the only bipartition that appears in over a majority of the trees is b 0. Hence, for this example, the strict and majority consensus for the phylogenies in Fig. 1 are the same. Also note, that all of the bipartitions in a strict consensus tree will also be contained in the majority tree. III. CONSTRUCTING DECISION TREES FOR PHYLOGENETIC ANALYSES Now, we give our approach for constructing decision trees for a collection of k evolutionary trees obtained from a phylogenetic analysis. Our approach consists of two steps. First, we construct a bipartition table that consists of the unique bipartitions associated with our set of k phylogenetic trees. The bipartition table also contains class labels concerning the run of the phylogenetic search heuristic that produced each of the k evolutionary trees. The second step concerns using the bipartition table to actually create the decision using the ID3 (Iterative Dichotomiser 3) data-mining algorithm [18]. A. Step 1: Constructing a Bipartition Table Consider the set of twelve evolutionary trees depicted in Fig. 2(a). First, we enumerate all of the bipartitions in each of the evolutionary trees. For example, tree t 0, which is shown pictorially in Fig. 1, has bipartitions ab cdef, cd abef, and ef abcd and they are given bipartition identities (BIDs) b 0, b 9, and b 14, respectively. The BIDs for the unique bipartitions can be assigned in any manner. However, in our example, the bipartitions are ordered in lexicographical order and hence the BIDs are assigned appropriately. Fig. 2(b) shows the set of unique bipartitions for the twelve trees in our example. Each unique bipartition, b i, will represent a column in the bipartition table. The rows of the table will consist of the input tree ids (t i ) of interest. For each column labeled b i, a 0 or 1 is assigned to each row based on whether tree t i contains that bipartition. Consider Table I. Bipartition b 0 appears in trees t 0,t 1,t 2, and t 3. It does not appear in the remaining eight trees. Once each bipartition (or feature) is encoded for each of the rows, the final step is to encode the known class labels for each tree. In our study, there are two class labels of interest: run and algorithm. Oftentimes, a phylogenetic technique is executed multiple times in order to improve the convergence rate of the underlying phylogenetic analysis. For each run, we know what trees were produced from it. Finally, a life scientist, in addition to running a phylogenetic multiple times, might also execute different phylogenetic techniques or algorithms. Hence, we could use the type of algorithm used to produce a tree as a class label. In the bipartition table shown in Table I, we only use a run class label, where we assume that two different runs of a phylogenetic algorithm produced the set of twelve trees. B. Step 2: Building a Decision Tree Once the bipartition table is constructed, we use our implementation of the ID3 algorithm, which is a top-down greedy approach that selects features based on information gain. For our problem, the features are bipartitions, which are the internal nodes of the decision tree. Class labels (runs and algorithm) are the leaves of the decision tree. Class labels are also considered target features. The ID3 algorithm takes a feature set (the presence or absence of each bipartition) and a target feature (i.e., which run the evolutionary trees came from) and computes a decision tree, where each node in the tree represents the a bipartition and each edge represents the presence or absence of that bipartition. Table I shows an example bipartition table as derived from the Newick strings in Fig. 2(a). Note the last column in the bipartition table is run label and not bipartition information. The ID3 algorithm computes the information gain of each bipartition relative to the run label. Information gain represents how well the bipartition information correlates with the run labels. The bipartition with the best information gain is selected and added as the root of the decision tree. In our example, bipartition b 0 is the root (see Fig. 3). All the trees with bipartition b 0 come from run r 1. Hence, the

4 t 0 : (((a, b), (c, d)), (e, f)); t 1 : (((a, b), (c, e)), (d, f)); t 2 : (((a, b), (c, f)), (d, e)); t 3 : (((a, b), (c, d)), (e, f)); t 4 : (((a, c), (b, d)), (e, f)); t 5 : (((a, e), (b, d)), (c, f)); t 6 : (((a, d), (b, c)), (e, f)); t 7 : (((a, d), (b, c)), (e, f)); t 8 : (((a, d), (b, f)), (c, e)); t 9 : (((a, f), (b, c)), (d, e)); t 10 : (((a, f), (b, d)), (c, e)); t 11 : (((a, e), (b, c)), (d, f)); b 0 : ab cdef b 1 : ac bdef b 2 : ad bcef b 3 : ae bcdf b 4 : af bcde b 5 : bc adef b 6 : bd acef b 7 : be acdf b 8 : bf acde b 9 : cd abef b 10 : ce acef b 11 : cf acde b 12 : de abcf b 13 : df abce b 14 : ef abcd (a) Newick strings (b) Unique bipartitions Fig. 2. Twelve evolutionary trees used as input to build the bipartition table in Table I. (a) shows the Newick representation of the twelve phylogenetic trees of interest. (b) provides a listing of the unique bipartitions that appear across the twelve trees. TABLE I A BIPARTITION TABLE DEPICTING THE PRESENCE ( 1 ) OR ABSENCE ( 0 ) OF EACH BIPARTITION (LABELED b i ), IN THE TWELVE TREES SHOWN IN FIG. 2ALONG WITH A COLUMN DENOTING THE RUN THAT GENERATED THE TREE. THE LIST OF BIPARTITIONS ARE OUR FEATURES AND THE RUN IS OUR CLASS LABEL SINCE WE KNOW WHAT TREES A PARTICULAR RUN GENERATED. IN THIS EXAMPLE, THERE ARE TWO RUNS THAT GENERATED THE TWELVE TREES. ABIPARTITION TABLE WILL SERVE AS INPUT FOR THE CONSTRUCTION OF A DECISION TREE. Bipartition table Trees b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 run label t r 1 t r 1 t r 1 t r 1 t r 1 t r 1 t r 2 t r 2 t r 2 t r 2 t r 2 t r 2 1 r1 b0 0 b6 Depth 0 Depth 1 r2 1 b4 1 0 r1 0 r2 Depth 2 Depth 3 Fig. 3. A decision tree constructed based on the input from Table I. Here, 3 bipartitions (out of 15) are needed to classify the twelve input trees by what run generated them. right child node is marked as run r 1 and has no children. The trees that do not have bipartition b 0 are from both runs so the process repeats to calculate the node with the most information gain. The process continues until all the branches terminate into leaf nodes. Fig. 3 shows the resulting decision tree. Equation (1) measures the entropy of a collection of trees. c Entropy(S) = p i log 2 p i (1) i=1 For a given set S of examples, the sum function iterates over all the different example values with p i being the percentage of the examples that have that value. To compute the entropy of the run labels, S is the rightmost column of Table I. We would iterate over the labels r 1 and r 2 with p i being the percentage of each label. Equation (2) computes information gain, where S is the set of examples and A is an attribute. S v Gain(S, A) =Entropy(S) S Entropy(s v) v Values(A) (2) Values(A) is the set of values that the attribute can have. For each v in Values(A), S v is the a subset of S which has that value.

5 IV. OUR DATASETS The biological trees used in this study were obtained from three recent Bayesian analysis and a maximum parsimony (MP) analysis, which we describe below. The evolutionary trees from Datasets #1 and #2 were obtained from published analyses performed by biologists. The trees from Dataset #3 were obtained by us as a result of using well-recognized MP heuristics. A. Dataset #1: 20,0000 Trees on 150 Taxa We received 20,000 trees from a Bayesian analysis of an alignment of 150 taxa (23 desert taxa and 127 others from freshwater, marine, ands oil habitats) with 1,651 aligned sites [19]. Two independent runs consisting of 25 million generations (trees were sampled every 1,000 generations) were performed using the GTR+I+Γ model in MrBayes with four independent chains. The authors constructed a majority consensus tree in their study using the 20,000 trees from the last 10 million generations from each of the two runs. B. Dataset #2: 33,306 trees on 567 Taxa We obtained 33,306 trees from a Bayesian analysis of a three-gene, 567 taxa (560 angiosperms, seven outgroups) dataset with 4,621 aligned characters, which is one of the largest Bayesian analysis done to date [20]. Twelve runs, with four chains each, using the GTR+I+Γ model in MrBayes ran for at least 10 million generations. Trees were sampled every 1,000 generations. The authors discuss the difficulties with combining trees from multiple runs. To obtain our collection of 33,306 trees, we discard the trees from the first 8 million generations. C. Dataset #3: 4,898 Trees on 567 Taxa We inferred 4,898 trees from a maximum parsimony (MP) analysis of a set of 567 three-gene (rbcl, atpb, and 18s) aligned DNA sequences (2,153 sites) of angiosperms [21]. We used two MP algorithms, parsimony ratchet [11] and Rec- I-DCM3 to obtain the evolutionary trees. Each MP algorithm created 5,000 trees for a total of 10,000 trees over 567 taxa. The parsimony ratchet algorithm used in this paper is called Pauprat since we used a Perl script by Bininda-Emonds [22] to generate a PAUP* [17] batch file to run the parsimony ratchet heuristic. However, we were not interested in all 10,000 trees. Of these 10,000 trees, we were only interested in those trees that are near-optimal. Biologists typically only use the topscoring (or near-optimal) in their published phylogenetic studies. In our experiments, we use parsimony trees that are step 0,step 1, and step 2 away from the best-known maximum parsimony score for this dataset, which is 44,165. Let x represent the parsimony score of a tree t i. Then, tree t i is x b steps away from the best score. In our experiments, step 0,step 1, and step 2 represents trees that are 0, 1, and 2 steps away from the best score, b, respectively. Between the two algorithms, there are 4,898 trees that fit this criteria. V. RESULTS AND DISCUSSION In our experiments, we compute the depths of the decision trees built from our collection of phylogenetic trees to study convergence. Shallow decision trees may represent that a phylogenetic analysis did not converge. Deeper decision trees would symbolize convergence. A. Dataset #1: 150 Taxa Bayesian Trees As explained in Section IV-A, two runs produced the 20,000 evolutionary trees for this dataset. Each run, r 0 and r 1, consisted of 10,000 trees. Surprisingly, after applying the ID3 algorithm to these to compare the two runs, a decision tree of depth one was created. In other words, there exists a single bipartition b i that appears in all of the evolutionary trees from run r 0,butb i does not appear in any evolutionary trees in run r 1. Hence, a single bipartition can distinguish between the two sets of 20,000 evolutionary trees! We refer to such a bipartition as a distinguishing bipartition. Given that a single bipartition can classify the two phylogenetic runs, there is poor mixing of solutions from the different runs, which is symptomatic of non-convergence. Even more interesting, is that for this dataset, distinguishing bipartitions would not appear in the strict or majority consensus tree. Since each run contains an equal number of trees, a distinguishing bipartition for this dataset cannot appear in over half of the trees, which would be a requirement to appear in the majority tree. If a bipartition is not in the majority tree, it cannot appear in the strict consensus, which requires the bipartition to appear in all 20,000 of the trees. Information related to distinguished bipartitions would be over looked by a consensus tree based analysis. Decision trees of depth one mean that there is at least one distinguishing bipartition but, how many distinguishing bipartitions exist for this dataset? For the 150 taxa data set there are four bipartitions that one run contains but the other does not. For this dataset, two distinguishing bipartitions appeared every in evolutionary tree in run r 0 but not r 1 and there are two bipartitions that appeared every time in run r 1 but not run r 0. Distinguishing bipartitions represent phylogenetic information that is strong in one region of tree space (returned every time in run 0) and weak in others (returned none of the time in run 1). If the phylogenetic analysis was executed only once, there would have been very strong support or lack of support (depending on the run) for these bipartitions. B. Dataset #2: 567 Taxa Bayesian Trees Next, we turn to the 567 taxa set of 33,306 Bayesian trees. As explained in Section IV-B, these evolutionary trees were collected over twelve runs of using MrBayes [1], one of the most popular software packages for reconstructing evolutionary trees. Table II shows the number of trees that were produced by each run. Similarly to the 150 taxa dataset of Bayesian trees, our target feature is run labels. In our experiments, we created decision trees to measure the similarity of each pair of runs. The upper triangle of

6 TABLE II THE NUMBER OF 567 TAXA BAYESIAN TREES IN EACH RUN. THERE ARE 12 TOTAL RUNS AND 33,306 TOTAL TREES. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r TABLE III BY COMPARING PAIRS OF BAYESIAN RUNS, THE UPPER TRIANGLE SHOWS THE DEPTHS OF THE CONSTRUCTED DECISION TREES. THE LOWER TRIANGLE PROVIDES THE NUMBER OF DIFFERENT BIPARTITIONS THAT CAN ACCOUNT FOR A DECISION TREE OF DEPTH ONE. DECISION TREES OF DEPTHS OTHER THAN ONE ARE SHADED IN GRAY. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r r r r r r r r r r r r TABLE IV THE UPPER TRIANGLE OF THIS TABLE SHOWS THE NUMBER OF UNIQUE BIPARTITIONS WHEN COMPARING PAIRS OF BAYESIAN RUNS. THE LOWER TRIANGLE DEPICTS THE PERCENTAGE OF TOTAL BIPARTITIONS USED IN THE DECISION TREE. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r r r r r r r r r r r r Table III shows the depths of these decision trees. Similarly to the 150 taxa Bayesian trees, there are distinguishing bipartitions. However, given that there are twelve runs (instead of two), there are many trees of depth one. The few values greater than one are shaded in gray in the table. For such run comparisons, a single distinguishing bipartition cannot classify the two sets of trees. For example, classifying runs r 1 and r 5 require looking at 11 bipartitions (or a decision tree of depth 11) in order to determine the run that generated one of the 6,163 trees. Although the two runs do not contain the same trees, there are no bipartitions that appeared in all of one run and none of the other. The decision must be made by looking at a combination bipartitions instead of a single bipartition. Deeper decision trees require more bipartitions to distinguish the sets of evolutionary trees, which in turn reflects good mixing of the underlying phylogenetic information (bipartitions) and is a sign of convergence between the two runs. If two runs contained identical trees, it is impossible for the ID3 algorithm to separate the trees into two different classes

7 based on the same bipartition information. Hence, the fact that we were able to produce these decision tree means that no two runs returned the same tree. Next, we count the number of distinguishing bipartitions that create decision tree of depth one. The lower triangle of Table III shows the results. For example, there are 41 distinguishing bipartitions separating runs r 2 and r 6. The presences of distinguishing bipartitions suggests that the phylogenetic analysis that produced these trees did not converge. There are strong distinctions between the bipartitions reported across runs. Moreover, the various runs appear to be stuck in local optima and return results influenced by their independent areas of the exponentially-sized tree space. Table IV shows that the number of distinguishing bipartitions increases drastically when comparing pairs of runs. C. Dataset #3: 567 Taxa Maximum Parsimony Trees For our final analysis, we apply decision trees to evolutionary trees obtained from two maximum parsimony (MP) heuristics, Rec-I-DCM3 and Pauprat. Table V shows the number of evolutionary trees collected from each run. Each of the MP heuristics required five runs to produce its set of evolutionary trees. Runs r 0,r 1,r 2,r 3, and r 4 were generated by Pauprat while the remaining runs were obtained from Rec- I-DCM3. Similarly to the Bayesian tree datasets, Table VI provides the depth of the resulting decision trees using runs as the target feature. The results show that the Rec-I-DCM3 runs are quite self-similar. Many of the run-by-run comparisons are non-separable, which we denote by the label NS. In other words, the two runs being compared returned at least one identical tree. As a result, the bipartition information cannot separate the evolutionary trees across the runs. In comparison to the Bayesian trees, the MP trees are more similar. Given the depth-level of the resulting decision trees, the MP trees represent a better example of convergence than their Bayesian counterparts. Based on the percentages of bipartitions used to create the the decision trees from the lower triangles of Tables IV and VII, the parsimony trees are up to two orders of magnitude more similar to each other than their Bayesian counterparts. VI. CONCLUSIONS AND FUTURE WORK Determining whether a phylogenetic analysis has converged is an important problem in phylogenetics. Without convergence, the robustness of a phylogenetic analysis is unclear and leads to inaccurate hypotheses of how the taxa evolved from a common ancestor. Popular phylogenetic approaches often leverage computational intelligence techniques to infer the true evolutionary tree. In this paper, we have shown how to use the depth of a decision tree (a traditional machine learning technique) as a measure of convergence for a phylogenetic analysis. Currently, there are relatively few techniques available to help biologists determine whether their analyses have converged. In addition to decision tree depth, the novelty of our work is in using bipartitions as the foundation for determining whether there is sufficient mixing of information to justify convergence. In our study of three, large biological tree collections obtained from Bayesian and maximum parsimony analyses, we showed that non-convergence was a property of the Bayesian analysis, which resulted in decision trees of depth one. That is, one bipartition is sufficient to classify thousands of trees into two groups (or runs). The maximum parsimony trees, on the other hand, resulted in decision trees of high depth, which is a requirement for phylogenetic convergence. We believe decision trees are a step forward in terms of defining concrete measures of convergence that biologists can use to build more robust and accurate evolutionary trees. Moreover, our results can be used potentially to help design better phylogenetic heuristics especially as it relates to avoiding getting trapped in local optima which is a major source of non-convergence. Our future work includes studying more datasets with our new convergence measure, incorporating tree scores into convergence framework, comparing decision trees to topology measures such as the Robinson-Foulds distance [23], and expanding our understanding of distinguishing bipartitions. VII. ACKNOWLEDGMENTS Funding for this project was supported by the National Science Foundation under grants DEB and IIS REFERENCES [1] J. P. Huelsenbeck and F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees, Bioinformatics, vol. 17, no. 8, pp , [2] J. Peltonen, J. Venna, and S. Kaski, Visualizations for assessing convergence and mixing of markov chain monte carlo simulations, Comput. Stat. Data Anal., vol. 53, no. 12, pp , [3] D. M. Hillis, T. A. Heath, and K. S. John, Analysis and visualization of tree space, Syst. Biol, vol. 54, no. 3, pp , [4] C. Stockham, L. S. Wang, and T. Warnow, Statistically based postprocessing of phylogenetic analysis by clustering, in Proceedings of 10th Int l Conf. on Intelligent Systems for Molecular Biology (ISMB 02), 2002, pp [5] B. Liu, Y. Xia, and P. S. Yu, Clustering through decision tree construction, in CIKM 00: Proceedings of the ninth international conference on Information and knowledge management. New York, NY, USA: ACM, 2000, pp [6] A. Rambaut and A. J. Drummond, Tracer v1.4, [Online]. Available: [7] J. P. Huelsenbeck, B. Larget, R. Miller, and F. Ronquist, Potential applications and pitfalls of bayesian inference of phylogeny, Syst. Biol., vol. 51, p. 673, [8] J. A. A. Nylander, J. C. Wilgenbusch, D. L. Warren, and D. L. Swofford, Awty (are we there yet?): a system for graphical exploration of mcmc convergence in bayesian phylogenetics. Bioinformatics, vol. 24, no. 4, pp , [Online]. Available: bioinformatics24.html#nylanderwws08 [9] N. J, Bayesian phylogenetic analysis of combined data, Syst. Biol., vol. 53, p. 47, [10] S.-J. Sul, S. Matthews, and T. L. Williams, Using tree diversity to compare phylogenetic heuristics, BMC Bioinformatics, vol. 10 (Suppl 4), no. S3, [11] K. C. Nixon, The parsimony ratchet, a new method for rapid parsimony analysis, Cladistics, vol. 15, pp , [12] J. Felsenstein, The Newick tree format, Internet Website, last accessed, September 2009, newick URL: washington.edu/phylip/newicktree.html.

8 TABLE V THE NUMBER OF MAXIMUM PARSIMONY TREES IN EACH RUN. RUNS r 0 TO r 4 WERE OBTAINED FROM PAUPRAT. THE REMAINING TREES WERE COLLECTED FROM REC-I-DCM3. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r TABLE VI THE DEPTH OF THE DECISION TREES WHEN COMPARING PAIRS OF MAXIMUM PARSIMONY RUNS. NSSTANDS FOR NON-SEPARABLE. THE BIPARTITIONS FROM EACH RUN ARE SO SIMILAR IN THESE CASES THAT THE ID3 ALGORITHM CAN NOT CLASSIFY THEM. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r r r r r r NS NS 39 NS r NS NS NS r NS r NS TABLE VII THE UPPER TRIANGLE OF THIS TABLE OF THIS TABLE SHOWS THE NUMBER OF UNIQUE BIPARTITIONS WHEN COMPARING PAIRS OF MAXIMUM PARSIMONY RUNS. THE LOWER TRIANGLE DEPICTS THE PERCENTAGE OF TOTAL BIPARTITIONS USED IN THE DECISION TREE. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r r r r r r r NS r NS NS r NS r NS NS NS NS - [13] U. Roshan, B. M. E. Moret, T. L. Williams, and T. Warnow, Rec-I-DCM3: a fast algorithmic techniques for reconstructing large phylogenetic trees, in Proc. IEEE Computer Society Bioinformatics Conference (CSB 2004). IEEE Press, 2004, pp [14] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys., vol. 21, pp , [15] W. Hastings, Monte carlo sampling methods using markov chains and their applications, Biometrika, vol. 57, pp , [16] P. A. Goloboff, J. S. Farris, and K. C. Nixon, TNT, a free program for phylogenetic analysis, Cladistics, vol. 24, no. 5, pp , [17] D. L. Swofford, PAUP*: Phylogenetic analysis using parsimony (and other methods), 2002, sinauer Associates, Underland, Massachusetts, Version 4.0. [18] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, [19] L. A. Lewis and P. O. Lewis, Unearthing the molecular phylodiversity of desert soil green algae (chlorophyta), Syst. Bio., vol. 54, no. 6, pp , [20] D. E. Soltis, M. A. Gitzendanner, and P. S. Soltis, A 567-taxon data set for angiosperms: The challenges posed by bayesian analyses of large data sets, Int. J. Plant Sci., vol. 168, no. 2, pp , [21] D. E. Soltis, P. S. Soltis, M. W. Chase, M. E. Mort, D. C. Albach, M. Zanis, V. Savolainen, W. H. Hahn, S. B. Hoot, M. F. Fay, M. Axtell, S. M. Swensen, L. M. Prince, W. J. Kress, K. C. Nixon, and J. S. Farris, Angiosperm phylogeny inferred from 18s rdna, rbcl, and atpb sequences, Botanical Journal of the Linnean Society, vol. 133, pp , [22] O. Bininda-Emonds, Ratchet implementation in PAUP*4.0b10, 2003, available from Emonds. [23] D. F. Robinson and L. R. Foulds, Comparison of phylogenetic trees, Mathematical Biosciences, vol. 53, pp , 1981.

Phylogenetic Trees Made Easy

Phylogenetic Trees Made Easy Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts

More information

Bayesian Phylogeny and Measures of Branch Support

Bayesian Phylogeny and Measures of Branch Support Bayesian Phylogeny and Measures of Branch Support Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The

More information

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference Stephane Guindon, F. Le Thiec, Patrice Duroux, Olivier Gascuel To cite this version: Stephane Guindon, F. Le Thiec, Patrice

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

Arbres formels et Arbre(s) de la Vie

Arbres formels et Arbre(s) de la Vie Arbres formels et Arbre(s) de la Vie A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Basic principles DATA sequence alignment distance

More information

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL What mathematical optimization can, and cannot, do for biologists Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL Introduction There is no shortage of literature about the

More information

PRec-I-DCM3: a parallel framework for fast and accurate large-scale phylogeny reconstruction

PRec-I-DCM3: a parallel framework for fast and accurate large-scale phylogeny reconstruction Int. J. Bioinformatics Research and Applications, Vol. 2, No. 4, 2006 407 PRec-I-DCM3: a parallel framework for fast and accurate large-scale phylogeny reconstruction Yuri Dotsenko*, Cristian Coarfa, Luay

More information

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

A comparison of methods for estimating the transition:transversion ratio from DNA sequences Molecular Phylogenetics and Evolution 32 (2004) 495 503 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev A comparison of methods for estimating the transition:transversion ratio from

More information

A short guide to phylogeny reconstruction

A short guide to phylogeny reconstruction A short guide to phylogeny reconstruction E. Michu Institute of Biophysics, Academy of Sciences of the Czech Republic, Brno, Czech Republic ABSTRACT This review is a short introduction to phylogenetic

More information

4 Techniques for Analyzing Large Data Sets

4 Techniques for Analyzing Large Data Sets 4 Techniques for Analyzing Large Data Sets Pablo A. Goloboff Contents 1 Introduction 70 2 Traditional Techniques 71 3 Composite Optima: Why Do Traditional Techniques Fail? 72 4 Techniques for Analyzing

More information

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues

More information

Model-based Synthesis. Tony O Hagan

Model-based Synthesis. Tony O Hagan Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML 9 June 2011 A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML by Jun Inoue, Mario dos Reis, and Ziheng Yang In this tutorial we will analyze

More information

Online Consensus and Agreement of Phylogenetic Trees.

Online Consensus and Agreement of Phylogenetic Trees. Online Consensus and Agreement of Phylogenetic Trees. Tanya Y. Berger-Wolf 1 Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA. tanyabw@cs.unm.edu Abstract. Computational

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question. Name: Class: Date: Chapter 17 Practice Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The correct order for the levels of Linnaeus's classification system,

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Molecular Clocks and Tree Dating with r8s and BEAST

Molecular Clocks and Tree Dating with r8s and BEAST Integrative Biology 200B University of California, Berkeley Principals of Phylogenetics: Ecology and Evolution Spring 2011 Updated by Nick Matzke Molecular Clocks and Tree Dating with r8s and BEAST Today

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Missing data and the accuracy of Bayesian phylogenetics

Missing data and the accuracy of Bayesian phylogenetics Journal of Systematics and Evolution 46 (3): 307 314 (2008) (formerly Acta Phytotaxonomica Sinica) doi: 10.3724/SP.J.1002.2008.08040 http://www.plantsystematics.com Missing data and the accuracy of Bayesian

More information

4. How many integers between 2004 and 4002 are perfect squares?

4. How many integers between 2004 and 4002 are perfect squares? 5 is 0% of what number? What is the value of + 3 4 + 99 00? (alternating signs) 3 A frog is at the bottom of a well 0 feet deep It climbs up 3 feet every day, but slides back feet each night If it started

More information

An experimental study comparing linguistic phylogenetic reconstruction methods *

An experimental study comparing linguistic phylogenetic reconstruction methods * An experimental study comparing linguistic phylogenetic reconstruction methods * François Barbançon, a Steven N. Evans, b Luay Nakhleh c, Don Ringe, d and Tandy Warnow, e, a Palantir Technologies, 100

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Holland s GA Schema Theorem

Holland s GA Schema Theorem Holland s GA Schema Theorem v Objective provide a formal model for the effectiveness of the GA search process. v In the following we will first approach the problem through the framework formalized by

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview - Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Phylogenetic systematics turns over a new leaf

Phylogenetic systematics turns over a new leaf 30 Review Phylogenetic systematics turns over a new leaf Paul O. Lewis Long restricted to the domain of molecular systematics and studies of molecular evolution, likelihood methods are now being used in

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative

More information

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes Modeling System Calls for Intrusion Detection with Dynamic Window Sizes Eleazar Eskin Computer Science Department Columbia University 5 West 2th Street, New York, NY 27 eeskin@cs.columbia.edu Salvatore

More information

Comparing Bootstrap and Posterior Probability Values in the Four-Taxon Case

Comparing Bootstrap and Posterior Probability Values in the Four-Taxon Case Syst. Biol. 52(4):477 487, 2003 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150390218213 Comparing Bootstrap and Posterior Probability Values

More information

Bayesian coalescent inference of population size history

Bayesian coalescent inference of population size history Bayesian coalescent inference of population size history Alexei Drummond University of Auckland Workshop on Population and Speciation Genomics, 2016 1st February 2016 1 / 39 BEAST tutorials Population

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute

More information

Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation

Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation Jack Sullivan,* Zaid Abdo, à Paul Joyce, à and David L. Swofford

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

More details on the inputs, functionality, and output can be found below.

More details on the inputs, functionality, and output can be found below. Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

More information

Visualization of Phylogenetic Trees and Metadata

Visualization of Phylogenetic Trees and Metadata Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Network Protocol Analysis using Bioinformatics Algorithms

Network Protocol Analysis using Bioinformatics Algorithms Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective

Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective Alexandros Stamatakis Institute of Computer Science, Foundation for Research and Technology-Hellas P.O. Box 1385, Heraklion,

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Likelihood: Frequentist vs Bayesian Reasoning

Likelihood: Frequentist vs Bayesian Reasoning "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B University of California, Berkeley Spring 2009 N Hallinan Likelihood: Frequentist vs Bayesian Reasoning Stochastic odels and

More information

High Throughput Network Analysis

High Throughput Network Analysis High Throughput Network Analysis Sumeet Agarwal 1,2, Gabriel Villar 1,2,3, and Nick S Jones 2,4,5 1 Systems Biology Doctoral Training Centre, University of Oxford, Oxford OX1 3QD, United Kingdom 2 Department

More information

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures Dmitri Krioukov, kc claffy, and Kevin Fall CAIDA/UCSD, and Intel Research, Berkeley Problem High-level Routing is

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

More information

Phylogenetic Analysis using MapReduce Programming Model

Phylogenetic Analysis using MapReduce Programming Model 2015 IEEE International Parallel and Distributed Processing Symposium Workshops Phylogenetic Analysis using MapReduce Programming Model Siddesh G M, K G Srinivasa*, Ishank Mishra, Abhinav Anurag, Eklavya

More information

Model-Based Cluster Analysis for Web Users Sessions

Model-Based Cluster Analysis for Web Users Sessions Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr

More information

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES Silvija Vlah Kristina Soric Visnja Vojvodic Rosenzweig Department of Mathematics

More information

A Network Flow Approach in Cloud Computing

A Network Flow Approach in Cloud Computing 1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges

More information

Scaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search

Scaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search Scaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search André Wehe 1 and J. Gordon Burleigh 2 1 Department of Computer Science, Iowa State University, Ames,

More information

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software 1 Reliability Guarantees in Automata Based Scheduling for Embedded Control Software Santhosh Prabhu, Aritra Hazra, Pallab Dasgupta Department of CSE, IIT Kharagpur West Bengal, India - 721302. Email: {santhosh.prabhu,

More information

Automated Plausibility Analysis of Large Phylogenies

Automated Plausibility Analysis of Large Phylogenies Automated Plausibility Analysis of Large Phylogenies Bachelor Thesis of David Dao At the Department of Informatics Institute of Theoretical Computer Science Reviewers: Advisors: Prof. Dr. Alexandros Stamatakis

More information

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Introduction to Phylogenetic Analysis

Introduction to Phylogenetic Analysis Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.

More information

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

More information

Scalable, Updatable Predictive Models for Sequence Data

Scalable, Updatable Predictive Models for Sequence Data Scalable, Updatable Predictive Models for Sequence Data Neeraj Koul, Ngot Bui, Vasant Honavar Artificial Intelligence Research Laboratory Dept. of Computer Science Iowa State University Ames, IA - 50014,

More information

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti

More information

A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

More information

Building a phylogenetic tree

Building a phylogenetic tree bioscience explained 134567 Wojciech Grajkowski Szkoła Festiwalu Nauki, ul. Ks. Trojdena 4, 02-109 Warszawa Building a phylogenetic tree Aim This activity shows how phylogenetic trees are constructed using

More information

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today. Section 1: The Linnaean System of Classification 17.1 Reading Guide KEY CONCEPT Organisms can be classified based on physical similarities. VOCABULARY taxonomy taxon binomial nomenclature genus MAIN IDEA:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

A Binary Model on the Basis of Imperialist Competitive Algorithm in Order to Solve the Problem of Knapsack 1-0

A Binary Model on the Basis of Imperialist Competitive Algorithm in Order to Solve the Problem of Knapsack 1-0 212 International Conference on System Engineering and Modeling (ICSEM 212) IPCSIT vol. 34 (212) (212) IACSIT Press, Singapore A Binary Model on the Basis of Imperialist Competitive Algorithm in Order

More information

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College

More information

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

A data management framework for the Fungal Tree of Life

A data management framework for the Fungal Tree of Life Web Accessible Sequence Analysis for Biological Inference A data management framework for the Fungal Tree of Life Kauff F, Cox CJ, Lutzoni F. 2007. WASABI: An automated sequence processing system for multi-gene

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Graph-theoretic Properties of Biological Networks COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Architectural features Motifs, modules,

More information

A Brief Study of the Nurse Scheduling Problem (NSP)

A Brief Study of the Nurse Scheduling Problem (NSP) A Brief Study of the Nurse Scheduling Problem (NSP) Lizzy Augustine, Morgan Faer, Andreas Kavountzis, Reema Patel Submitted Tuesday December 15, 2009 0. Introduction and Background Our interest in the

More information

GRAPH THEORY LECTURE 4: TREES

GRAPH THEORY LECTURE 4: TREES GRAPH THEORY LECTURE 4: TREES Abstract. 3.1 presents some standard characterizations and properties of trees. 3.2 presents several different types of trees. 3.7 develops a counting method based on a bijection

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

DnaSP, DNA polymorphism analyses by the coalescent and other methods. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information