GIVEN a collection of organisms (or taxa), the objective
|
|
- Rafe Ball
- 7 years ago
- Views:
Transcription
1 Using Decision Trees to Study the Convergence of Phylogenetic Analyses Grant Brammer and Tiffani L. Williams Abstract In this paper, we explore the novel use of decision trees to study the convergence properties of phylogenetic analyses. A decision learning tree is constructed from the evolutionary relationships (or bipartitions) found in the evolutionary trees returned from a phylogenetic analysis. We treat evolutionary trees returned from multiple runs of a phylogenetic analysis as different classes. Then, we use the depth of a decision tree as a technique to measure how distinct the runs are from each other. Decision trees with shallow depth reflect nonconvergence since the evolutionary trees can be classified with little information. Deep decision tree depths reflect convergence. We study Bayesian and maximum parsimony phylogenetic analyses consisting of thousands of trees. For some datasets studied here, a single distinguishing bipartition can classify the entire tree collection suggesting non-convergence of the underlying phylogenetic analysis. Thus, we believe that decision trees lead to new insights with the potential for helping biologists reconstruct more robust evolutionary trees. I. INTRODUCTION GIVEN a collection of organisms (or taxa), the objective of a phylogenetic analysis is to produce an evolutionary tree describing the genealogical relationships between the taxa. Bayesian inference, as implemented in the MrBayes [1] software package, is one of the most popular approaches for reconstructing evolutionary trees. Although Bayesian inference is very powerful, closed-form solutions are unlikely given the problem sizes of interest (hundreds to thousands of taxa eventually scaling to reconstructing the Tree of Life, which is estimated to contain between 10 and 100 million taxa). Markov Chain Monte Carlo (MCMC) sampling is a very versatile yet computationally intensive procedure, which produces samples of parameter values from the posterior distribution. Unfortunately, the main problem with MCMC simulations is assessing whether they have converged. The resulting samples come from the true distribution only after convergence [2]. For phylogenetic inference, non-convergence implies that the MCMC sampling did not sample from the distribution of the true evolutionary tree leading to an inaccurate estimate of the true tree. In this paper, we use decision tree learning as a novel approach to study whether a phylogenetic analysis has converged. A phylogenetic analysis takes as input a set of molecular sequences and outputs typically thousands of trees. Our objective is to examine these output trees for evidence that the phylogenetic analysis converged. Traditionally, life scientists take the collection of output trees produced by a phylogenetic analysis and summarized them with a single Grant Brammer and Tiffani Williams are with the Department of Computer Science and Engineering, Texas A&M University, College Station, Texas, USA ( {grb, tlw}@cse.tamu.edu). consensus tree. However, this results in an underestimation of the data [3], [4]. Furthermore, consensus trees provide little insight when studying the convergence properties of the underlying phylogenetic analysis. Many phylogenetic methods employ computational intelligence techniques to infer the true evolutionary tree. In our work, we use a traditional machine learning approach (decision trees) to detect whether a phylogenetic analysis converged. In our construction of decision trees, the evolutionary relationships contained in the phylogenetic trees are the interior nodes in our decision trees. The leaves of the decision tree reflect the run that produced the trees of interest. Since the most popular phylogenetic techniques (such as Bayesian inference) attempt to solve NP-hard optimization problems, multiple runs of a phylogenetic algorithm or heuristic are executed in order to escape local optima and ensure convergence. However, there are no good measures for whether either of these goals are achieved by running a phylogenetic technique multiple times. We believe decision trees can assist in this regard. The depth of a decision tree is the crucial feature for determining whether a phylogenetic analysis, consisting of multiple runs, has converged. We define the depth of a decision tree to be the length of the longest path from the root to a leaf node. Deep (high depth) decision trees reflect a high level of sharing among the evolutionary trees since many evolutionary relationships have to be consulted in order to classify what runs produced the evolutionary trees. Shallow decision trees reflect less sharing of information of across the runs. Hence, deep decision trees provide strong evidence that a phylogenetic analysis converged while extremely shallow trees reflect non-convergence. Our work leverages the idea that the depth of a decision tree can be used to measure the quality of a cluster. A similiar idea has been used to develop new clustering methods [5]. Decision trees and feature selection are well researched areas in the machine learning community but, they have not yet been throughly applied to phylogenetic data sets. In many ways, the work most similar in methods to ours is that which attempts to measure the quality of clusters produced by clustering phylogenetic trees. Stockham, Wang, and Warnow [4] cluster phylogenetic trees and use the concept of information loss as a measure of cluster quality, but do not address the issue of convergence. Unlike our work, most techniques for detecting convergence use tree scores [6], but rarely do they compare the evolutionary relationships contained in the trees to each other. Comparing tree scores can be misleading since trees with similar scores are not necessarily close in
2 parameter (tree) space, leading to misleading results [7], [8], [9], [10]. In our experiments, we study three biological tree collections containing thousands of evolutionary trees consisting of (i) 4,898, (ii) 20,000, and (iii) 33,306 evolutionary trees. The smallest tree collection was the result of using two maximum parsimony techniques, parsimony ratchet [11] and Rec-I- DCM3. The larger collections, which are published tree analyses were obtained from life scientists, were the result of a Bayesian analysis. Using the depth of a decision tree as our convergence criteria, our results show that the maximum parsimony analysis demonstrated convergence while both Bayesian analyses showed non-convergence. Surprisingly, non-convergence was depicted by a decision tree of depth one. For example, a single evolutionary relationship can be used to classify the whole 20,000 tree Bayesian data set. Thus, we believe that decision trees lead to new insights with the potential for helping biologists reconstruct more robust evolutionary trees. II. BASICS A. Evolutionary Trees and Their Newick Representation An evolutionary tree (or phylogeny) is a depiction of the evolutionary relationships between a set of taxa (or organisms). Fig. 1 shows an example phylogenetic tree on six taxa. A phylogenetic tree can be uniquely defined by its set of bipartitions (or edges). When removed, a bipartition partitions the taxa into two sets. In Fig. 1, consider bipartition b 0 in tree t 0. It partitions the set of taxa into two groups: taxa a and b on one side and taxa c, d, e, and f on the other side. We represent this bipartition as ab cdef, which is also shared between trees t 1 and t 2. A set of bipartitions uniquely defines an evolutionary tree. In this paper, we only consider nontrivial bipartitions (internal edges). For a given set of n taxa, every tree will have bipartitions relating to external edges (i.e., edges connecting directly to taxa). Most phylogenetic trees are unrooted since determining the root is a complex procedure. The Newick format [12] is the most widely used format to store a phylogenetic tree in a file. In this format, the topology of the evolutionary tree is represented using a notation based on balanced parentheses. Consider the evolutionary tree in Fig. 1. A Newick representation of the unrooted topology of this tree is ((a, b), (c, d)), (e, f));, where ; symbolizes the end of the Newick string. Matching pairs of parentheses symbolize internal nodes in the evolutionary tree. The Newick representation of a tree is not a unique. For example, another valid Newick string for tree t 0 is ((e, f), ((d, c), (b, a)));. B. Reconstructing Evolutionary Trees In order to study the convergence properties of phylogenetic analyses using decision trees, we consider two types of phylogenetic analyses: Bayesian and Maximum Parsimony inference. In our experiments, MrBayes [1], a very popular software package for phylogenetics, is used to reconstruct the evolutionary trees based on Bayesian inference. Under maximum parsimony, we use two hill-climbing heuristics, Rec-I-DCM3 [13] and parsimony ratchet [11], to reconstruct the phylogenies. a) Bayesian inference: The Bayesian approach is based on a quantity called the posterior probability of a tree. Bayes theorem Pr(Tree Data) = Pr(Data Tree) Pr(Tree) Pr(Data) is used to combine the prior probability of a phylogeny (Pr(Tree)) with the likelihood (Pr(Data Tree)) to produce a posterior probability distribution on trees (Pr(Tree Data)). The posterior probability represents the probability that the tree is correct. Inferences about the history of the group are then based on the posterior probability of trees. Typically all trees are considered a priori equally probable, and likelihood is calculated using a substitution model of evolution. Computing the posterior probability involves a summation over all trees, and, for each tree, integration over all possible combinations of branch length and substitution model parameter values. A number of numerical methods are available to allow the posterior probability to be approximated. The most common is Markov Chain Monte Carlo (MCMC). For the phylogeny problem, the MCMC algorithm involves two steps. First, a new tree is proposed by stochastically perturbing the current tree. Afterwards, the tree is either accepted or rejected with a probability described by Metropolis et al. [14] and Hastings [15]. If the new tree is accepted, then it is subjected to more perturbations. For a properly constructed and adequately run Markov chain, the proportion of the time that any tree is visited is a valid approximation of the posterior probability of that tree. b) Maximum parsimony: The maximum parsimony (MP) optimization criterion for inferring the evolutionary history of different taxa assumes that each of the taxa in the input is represented by a string over some alphabet. The symbols in the alphabet can represent nucleotides (in which case, the input are DNA or RNA sequences), or amino-acids (in which case the input are protein sequences), or may even include discrete characters for morphological properties. It is also assumed that the strings are put into a multiple alignment, so that they all have the same length. Maximum parsimony then seeks a tree, along with inferred ancestral sequences, so as to minimize the total number of evolutionary events by counting only point mutations. Most of the powerful heuristics [16], [11], [17] for solving the maximum parsimony problem incorporate hill-climbing heuristics in their design. Hill-climbing heuristics take an initial estimate (e.g., user-provided, random, or random sequence addition tree) of the phylogeny and rearrange branches in it to reach neighboring trees. If a rearrangement yields a better scoring tree, it becomes the new best tree and it is then submitted to a new round of rearrangements. The process continues until no better tree can be found in a full round.
3 d c e c d e f b9 b f b10 b f b12 b e b14 b0 a d b13 b0 a c b11 b0 t0 t1 t2 a Fig. 1. Three evolutionary trees on six taxa. The taxa are labeled a,b,...f. The bipartitions (b 0,b 9,b 10,b 11,b 12, and b 14 ) are labeled according to Fig. 2(b). C. Consensus Trees Both Bayesian and maximum parsimony inference can easily produce thousands of trees. Consensus trees summarize the information contained in a set of evolutionary trees whose leaves are all from the same set of taxa. Strict consensus is the most conservative of the consensus methods and produces a tree with only those phylogenetic relationships that are supported by all the source trees. A majority rule consensus tree contains those relationships that appear in more than half of the source trees. Consider the three trees in Fig. 1. The bipartition b 0 appears in each of the trees. As a result, the unrooted Newick representation of the strict consensus tree will be ((a, b), (c, d, e, f));. For the majority tree, the only bipartition that appears in over a majority of the trees is b 0. Hence, for this example, the strict and majority consensus for the phylogenies in Fig. 1 are the same. Also note, that all of the bipartitions in a strict consensus tree will also be contained in the majority tree. III. CONSTRUCTING DECISION TREES FOR PHYLOGENETIC ANALYSES Now, we give our approach for constructing decision trees for a collection of k evolutionary trees obtained from a phylogenetic analysis. Our approach consists of two steps. First, we construct a bipartition table that consists of the unique bipartitions associated with our set of k phylogenetic trees. The bipartition table also contains class labels concerning the run of the phylogenetic search heuristic that produced each of the k evolutionary trees. The second step concerns using the bipartition table to actually create the decision using the ID3 (Iterative Dichotomiser 3) data-mining algorithm [18]. A. Step 1: Constructing a Bipartition Table Consider the set of twelve evolutionary trees depicted in Fig. 2(a). First, we enumerate all of the bipartitions in each of the evolutionary trees. For example, tree t 0, which is shown pictorially in Fig. 1, has bipartitions ab cdef, cd abef, and ef abcd and they are given bipartition identities (BIDs) b 0, b 9, and b 14, respectively. The BIDs for the unique bipartitions can be assigned in any manner. However, in our example, the bipartitions are ordered in lexicographical order and hence the BIDs are assigned appropriately. Fig. 2(b) shows the set of unique bipartitions for the twelve trees in our example. Each unique bipartition, b i, will represent a column in the bipartition table. The rows of the table will consist of the input tree ids (t i ) of interest. For each column labeled b i, a 0 or 1 is assigned to each row based on whether tree t i contains that bipartition. Consider Table I. Bipartition b 0 appears in trees t 0,t 1,t 2, and t 3. It does not appear in the remaining eight trees. Once each bipartition (or feature) is encoded for each of the rows, the final step is to encode the known class labels for each tree. In our study, there are two class labels of interest: run and algorithm. Oftentimes, a phylogenetic technique is executed multiple times in order to improve the convergence rate of the underlying phylogenetic analysis. For each run, we know what trees were produced from it. Finally, a life scientist, in addition to running a phylogenetic multiple times, might also execute different phylogenetic techniques or algorithms. Hence, we could use the type of algorithm used to produce a tree as a class label. In the bipartition table shown in Table I, we only use a run class label, where we assume that two different runs of a phylogenetic algorithm produced the set of twelve trees. B. Step 2: Building a Decision Tree Once the bipartition table is constructed, we use our implementation of the ID3 algorithm, which is a top-down greedy approach that selects features based on information gain. For our problem, the features are bipartitions, which are the internal nodes of the decision tree. Class labels (runs and algorithm) are the leaves of the decision tree. Class labels are also considered target features. The ID3 algorithm takes a feature set (the presence or absence of each bipartition) and a target feature (i.e., which run the evolutionary trees came from) and computes a decision tree, where each node in the tree represents the a bipartition and each edge represents the presence or absence of that bipartition. Table I shows an example bipartition table as derived from the Newick strings in Fig. 2(a). Note the last column in the bipartition table is run label and not bipartition information. The ID3 algorithm computes the information gain of each bipartition relative to the run label. Information gain represents how well the bipartition information correlates with the run labels. The bipartition with the best information gain is selected and added as the root of the decision tree. In our example, bipartition b 0 is the root (see Fig. 3). All the trees with bipartition b 0 come from run r 1. Hence, the
4 t 0 : (((a, b), (c, d)), (e, f)); t 1 : (((a, b), (c, e)), (d, f)); t 2 : (((a, b), (c, f)), (d, e)); t 3 : (((a, b), (c, d)), (e, f)); t 4 : (((a, c), (b, d)), (e, f)); t 5 : (((a, e), (b, d)), (c, f)); t 6 : (((a, d), (b, c)), (e, f)); t 7 : (((a, d), (b, c)), (e, f)); t 8 : (((a, d), (b, f)), (c, e)); t 9 : (((a, f), (b, c)), (d, e)); t 10 : (((a, f), (b, d)), (c, e)); t 11 : (((a, e), (b, c)), (d, f)); b 0 : ab cdef b 1 : ac bdef b 2 : ad bcef b 3 : ae bcdf b 4 : af bcde b 5 : bc adef b 6 : bd acef b 7 : be acdf b 8 : bf acde b 9 : cd abef b 10 : ce acef b 11 : cf acde b 12 : de abcf b 13 : df abce b 14 : ef abcd (a) Newick strings (b) Unique bipartitions Fig. 2. Twelve evolutionary trees used as input to build the bipartition table in Table I. (a) shows the Newick representation of the twelve phylogenetic trees of interest. (b) provides a listing of the unique bipartitions that appear across the twelve trees. TABLE I A BIPARTITION TABLE DEPICTING THE PRESENCE ( 1 ) OR ABSENCE ( 0 ) OF EACH BIPARTITION (LABELED b i ), IN THE TWELVE TREES SHOWN IN FIG. 2ALONG WITH A COLUMN DENOTING THE RUN THAT GENERATED THE TREE. THE LIST OF BIPARTITIONS ARE OUR FEATURES AND THE RUN IS OUR CLASS LABEL SINCE WE KNOW WHAT TREES A PARTICULAR RUN GENERATED. IN THIS EXAMPLE, THERE ARE TWO RUNS THAT GENERATED THE TWELVE TREES. ABIPARTITION TABLE WILL SERVE AS INPUT FOR THE CONSTRUCTION OF A DECISION TREE. Bipartition table Trees b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 run label t r 1 t r 1 t r 1 t r 1 t r 1 t r 1 t r 2 t r 2 t r 2 t r 2 t r 2 t r 2 1 r1 b0 0 b6 Depth 0 Depth 1 r2 1 b4 1 0 r1 0 r2 Depth 2 Depth 3 Fig. 3. A decision tree constructed based on the input from Table I. Here, 3 bipartitions (out of 15) are needed to classify the twelve input trees by what run generated them. right child node is marked as run r 1 and has no children. The trees that do not have bipartition b 0 are from both runs so the process repeats to calculate the node with the most information gain. The process continues until all the branches terminate into leaf nodes. Fig. 3 shows the resulting decision tree. Equation (1) measures the entropy of a collection of trees. c Entropy(S) = p i log 2 p i (1) i=1 For a given set S of examples, the sum function iterates over all the different example values with p i being the percentage of the examples that have that value. To compute the entropy of the run labels, S is the rightmost column of Table I. We would iterate over the labels r 1 and r 2 with p i being the percentage of each label. Equation (2) computes information gain, where S is the set of examples and A is an attribute. S v Gain(S, A) =Entropy(S) S Entropy(s v) v Values(A) (2) Values(A) is the set of values that the attribute can have. For each v in Values(A), S v is the a subset of S which has that value.
5 IV. OUR DATASETS The biological trees used in this study were obtained from three recent Bayesian analysis and a maximum parsimony (MP) analysis, which we describe below. The evolutionary trees from Datasets #1 and #2 were obtained from published analyses performed by biologists. The trees from Dataset #3 were obtained by us as a result of using well-recognized MP heuristics. A. Dataset #1: 20,0000 Trees on 150 Taxa We received 20,000 trees from a Bayesian analysis of an alignment of 150 taxa (23 desert taxa and 127 others from freshwater, marine, ands oil habitats) with 1,651 aligned sites [19]. Two independent runs consisting of 25 million generations (trees were sampled every 1,000 generations) were performed using the GTR+I+Γ model in MrBayes with four independent chains. The authors constructed a majority consensus tree in their study using the 20,000 trees from the last 10 million generations from each of the two runs. B. Dataset #2: 33,306 trees on 567 Taxa We obtained 33,306 trees from a Bayesian analysis of a three-gene, 567 taxa (560 angiosperms, seven outgroups) dataset with 4,621 aligned characters, which is one of the largest Bayesian analysis done to date [20]. Twelve runs, with four chains each, using the GTR+I+Γ model in MrBayes ran for at least 10 million generations. Trees were sampled every 1,000 generations. The authors discuss the difficulties with combining trees from multiple runs. To obtain our collection of 33,306 trees, we discard the trees from the first 8 million generations. C. Dataset #3: 4,898 Trees on 567 Taxa We inferred 4,898 trees from a maximum parsimony (MP) analysis of a set of 567 three-gene (rbcl, atpb, and 18s) aligned DNA sequences (2,153 sites) of angiosperms [21]. We used two MP algorithms, parsimony ratchet [11] and Rec- I-DCM3 to obtain the evolutionary trees. Each MP algorithm created 5,000 trees for a total of 10,000 trees over 567 taxa. The parsimony ratchet algorithm used in this paper is called Pauprat since we used a Perl script by Bininda-Emonds [22] to generate a PAUP* [17] batch file to run the parsimony ratchet heuristic. However, we were not interested in all 10,000 trees. Of these 10,000 trees, we were only interested in those trees that are near-optimal. Biologists typically only use the topscoring (or near-optimal) in their published phylogenetic studies. In our experiments, we use parsimony trees that are step 0,step 1, and step 2 away from the best-known maximum parsimony score for this dataset, which is 44,165. Let x represent the parsimony score of a tree t i. Then, tree t i is x b steps away from the best score. In our experiments, step 0,step 1, and step 2 represents trees that are 0, 1, and 2 steps away from the best score, b, respectively. Between the two algorithms, there are 4,898 trees that fit this criteria. V. RESULTS AND DISCUSSION In our experiments, we compute the depths of the decision trees built from our collection of phylogenetic trees to study convergence. Shallow decision trees may represent that a phylogenetic analysis did not converge. Deeper decision trees would symbolize convergence. A. Dataset #1: 150 Taxa Bayesian Trees As explained in Section IV-A, two runs produced the 20,000 evolutionary trees for this dataset. Each run, r 0 and r 1, consisted of 10,000 trees. Surprisingly, after applying the ID3 algorithm to these to compare the two runs, a decision tree of depth one was created. In other words, there exists a single bipartition b i that appears in all of the evolutionary trees from run r 0,butb i does not appear in any evolutionary trees in run r 1. Hence, a single bipartition can distinguish between the two sets of 20,000 evolutionary trees! We refer to such a bipartition as a distinguishing bipartition. Given that a single bipartition can classify the two phylogenetic runs, there is poor mixing of solutions from the different runs, which is symptomatic of non-convergence. Even more interesting, is that for this dataset, distinguishing bipartitions would not appear in the strict or majority consensus tree. Since each run contains an equal number of trees, a distinguishing bipartition for this dataset cannot appear in over half of the trees, which would be a requirement to appear in the majority tree. If a bipartition is not in the majority tree, it cannot appear in the strict consensus, which requires the bipartition to appear in all 20,000 of the trees. Information related to distinguished bipartitions would be over looked by a consensus tree based analysis. Decision trees of depth one mean that there is at least one distinguishing bipartition but, how many distinguishing bipartitions exist for this dataset? For the 150 taxa data set there are four bipartitions that one run contains but the other does not. For this dataset, two distinguishing bipartitions appeared every in evolutionary tree in run r 0 but not r 1 and there are two bipartitions that appeared every time in run r 1 but not run r 0. Distinguishing bipartitions represent phylogenetic information that is strong in one region of tree space (returned every time in run 0) and weak in others (returned none of the time in run 1). If the phylogenetic analysis was executed only once, there would have been very strong support or lack of support (depending on the run) for these bipartitions. B. Dataset #2: 567 Taxa Bayesian Trees Next, we turn to the 567 taxa set of 33,306 Bayesian trees. As explained in Section IV-B, these evolutionary trees were collected over twelve runs of using MrBayes [1], one of the most popular software packages for reconstructing evolutionary trees. Table II shows the number of trees that were produced by each run. Similarly to the 150 taxa dataset of Bayesian trees, our target feature is run labels. In our experiments, we created decision trees to measure the similarity of each pair of runs. The upper triangle of
6 TABLE II THE NUMBER OF 567 TAXA BAYESIAN TREES IN EACH RUN. THERE ARE 12 TOTAL RUNS AND 33,306 TOTAL TREES. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r TABLE III BY COMPARING PAIRS OF BAYESIAN RUNS, THE UPPER TRIANGLE SHOWS THE DEPTHS OF THE CONSTRUCTED DECISION TREES. THE LOWER TRIANGLE PROVIDES THE NUMBER OF DIFFERENT BIPARTITIONS THAT CAN ACCOUNT FOR A DECISION TREE OF DEPTH ONE. DECISION TREES OF DEPTHS OTHER THAN ONE ARE SHADED IN GRAY. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r r r r r r r r r r r r TABLE IV THE UPPER TRIANGLE OF THIS TABLE SHOWS THE NUMBER OF UNIQUE BIPARTITIONS WHEN COMPARING PAIRS OF BAYESIAN RUNS. THE LOWER TRIANGLE DEPICTS THE PERCENTAGE OF TOTAL BIPARTITIONS USED IN THE DECISION TREE. Dataset #2: 567 taxa Bayesian trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r r r r r r r r r r r r Table III shows the depths of these decision trees. Similarly to the 150 taxa Bayesian trees, there are distinguishing bipartitions. However, given that there are twelve runs (instead of two), there are many trees of depth one. The few values greater than one are shaded in gray in the table. For such run comparisons, a single distinguishing bipartition cannot classify the two sets of trees. For example, classifying runs r 1 and r 5 require looking at 11 bipartitions (or a decision tree of depth 11) in order to determine the run that generated one of the 6,163 trees. Although the two runs do not contain the same trees, there are no bipartitions that appeared in all of one run and none of the other. The decision must be made by looking at a combination bipartitions instead of a single bipartition. Deeper decision trees require more bipartitions to distinguish the sets of evolutionary trees, which in turn reflects good mixing of the underlying phylogenetic information (bipartitions) and is a sign of convergence between the two runs. If two runs contained identical trees, it is impossible for the ID3 algorithm to separate the trees into two different classes
7 based on the same bipartition information. Hence, the fact that we were able to produce these decision tree means that no two runs returned the same tree. Next, we count the number of distinguishing bipartitions that create decision tree of depth one. The lower triangle of Table III shows the results. For example, there are 41 distinguishing bipartitions separating runs r 2 and r 6. The presences of distinguishing bipartitions suggests that the phylogenetic analysis that produced these trees did not converge. There are strong distinctions between the bipartitions reported across runs. Moreover, the various runs appear to be stuck in local optima and return results influenced by their independent areas of the exponentially-sized tree space. Table IV shows that the number of distinguishing bipartitions increases drastically when comparing pairs of runs. C. Dataset #3: 567 Taxa Maximum Parsimony Trees For our final analysis, we apply decision trees to evolutionary trees obtained from two maximum parsimony (MP) heuristics, Rec-I-DCM3 and Pauprat. Table V shows the number of evolutionary trees collected from each run. Each of the MP heuristics required five runs to produce its set of evolutionary trees. Runs r 0,r 1,r 2,r 3, and r 4 were generated by Pauprat while the remaining runs were obtained from Rec- I-DCM3. Similarly to the Bayesian tree datasets, Table VI provides the depth of the resulting decision trees using runs as the target feature. The results show that the Rec-I-DCM3 runs are quite self-similar. Many of the run-by-run comparisons are non-separable, which we denote by the label NS. In other words, the two runs being compared returned at least one identical tree. As a result, the bipartition information cannot separate the evolutionary trees across the runs. In comparison to the Bayesian trees, the MP trees are more similar. Given the depth-level of the resulting decision trees, the MP trees represent a better example of convergence than their Bayesian counterparts. Based on the percentages of bipartitions used to create the the decision trees from the lower triangles of Tables IV and VII, the parsimony trees are up to two orders of magnitude more similar to each other than their Bayesian counterparts. VI. CONCLUSIONS AND FUTURE WORK Determining whether a phylogenetic analysis has converged is an important problem in phylogenetics. Without convergence, the robustness of a phylogenetic analysis is unclear and leads to inaccurate hypotheses of how the taxa evolved from a common ancestor. Popular phylogenetic approaches often leverage computational intelligence techniques to infer the true evolutionary tree. In this paper, we have shown how to use the depth of a decision tree (a traditional machine learning technique) as a measure of convergence for a phylogenetic analysis. Currently, there are relatively few techniques available to help biologists determine whether their analyses have converged. In addition to decision tree depth, the novelty of our work is in using bipartitions as the foundation for determining whether there is sufficient mixing of information to justify convergence. In our study of three, large biological tree collections obtained from Bayesian and maximum parsimony analyses, we showed that non-convergence was a property of the Bayesian analysis, which resulted in decision trees of depth one. That is, one bipartition is sufficient to classify thousands of trees into two groups (or runs). The maximum parsimony trees, on the other hand, resulted in decision trees of high depth, which is a requirement for phylogenetic convergence. We believe decision trees are a step forward in terms of defining concrete measures of convergence that biologists can use to build more robust and accurate evolutionary trees. Moreover, our results can be used potentially to help design better phylogenetic heuristics especially as it relates to avoiding getting trapped in local optima which is a major source of non-convergence. Our future work includes studying more datasets with our new convergence measure, incorporating tree scores into convergence framework, comparing decision trees to topology measures such as the Robinson-Foulds distance [23], and expanding our understanding of distinguishing bipartitions. VII. ACKNOWLEDGMENTS Funding for this project was supported by the National Science Foundation under grants DEB and IIS REFERENCES [1] J. P. Huelsenbeck and F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees, Bioinformatics, vol. 17, no. 8, pp , [2] J. Peltonen, J. Venna, and S. Kaski, Visualizations for assessing convergence and mixing of markov chain monte carlo simulations, Comput. Stat. Data Anal., vol. 53, no. 12, pp , [3] D. M. Hillis, T. A. Heath, and K. S. John, Analysis and visualization of tree space, Syst. Biol, vol. 54, no. 3, pp , [4] C. Stockham, L. S. Wang, and T. Warnow, Statistically based postprocessing of phylogenetic analysis by clustering, in Proceedings of 10th Int l Conf. on Intelligent Systems for Molecular Biology (ISMB 02), 2002, pp [5] B. Liu, Y. Xia, and P. S. Yu, Clustering through decision tree construction, in CIKM 00: Proceedings of the ninth international conference on Information and knowledge management. New York, NY, USA: ACM, 2000, pp [6] A. Rambaut and A. J. Drummond, Tracer v1.4, [Online]. Available: [7] J. P. Huelsenbeck, B. Larget, R. Miller, and F. Ronquist, Potential applications and pitfalls of bayesian inference of phylogeny, Syst. Biol., vol. 51, p. 673, [8] J. A. A. Nylander, J. C. Wilgenbusch, D. L. Warren, and D. L. Swofford, Awty (are we there yet?): a system for graphical exploration of mcmc convergence in bayesian phylogenetics. Bioinformatics, vol. 24, no. 4, pp , [Online]. Available: bioinformatics24.html#nylanderwws08 [9] N. J, Bayesian phylogenetic analysis of combined data, Syst. Biol., vol. 53, p. 47, [10] S.-J. Sul, S. Matthews, and T. L. Williams, Using tree diversity to compare phylogenetic heuristics, BMC Bioinformatics, vol. 10 (Suppl 4), no. S3, [11] K. C. Nixon, The parsimony ratchet, a new method for rapid parsimony analysis, Cladistics, vol. 15, pp , [12] J. Felsenstein, The Newick tree format, Internet Website, last accessed, September 2009, newick URL: washington.edu/phylip/newicktree.html.
8 TABLE V THE NUMBER OF MAXIMUM PARSIMONY TREES IN EACH RUN. RUNS r 0 TO r 4 WERE OBTAINED FROM PAUPRAT. THE REMAINING TREES WERE COLLECTED FROM REC-I-DCM3. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r TABLE VI THE DEPTH OF THE DECISION TREES WHEN COMPARING PAIRS OF MAXIMUM PARSIMONY RUNS. NSSTANDS FOR NON-SEPARABLE. THE BIPARTITIONS FROM EACH RUN ARE SO SIMILAR IN THESE CASES THAT THE ID3 ALGORITHM CAN NOT CLASSIFY THEM. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r r r r r r NS NS 39 NS r NS NS NS r NS r NS TABLE VII THE UPPER TRIANGLE OF THIS TABLE OF THIS TABLE SHOWS THE NUMBER OF UNIQUE BIPARTITIONS WHEN COMPARING PAIRS OF MAXIMUM PARSIMONY RUNS. THE LOWER TRIANGLE DEPICTS THE PERCENTAGE OF TOTAL BIPARTITIONS USED IN THE DECISION TREE. Dataset #3: 567 taxa maximum parsimony trees run r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r r r r r r r NS r NS NS r NS r NS NS NS NS - [13] U. Roshan, B. M. E. Moret, T. L. Williams, and T. Warnow, Rec-I-DCM3: a fast algorithmic techniques for reconstructing large phylogenetic trees, in Proc. IEEE Computer Society Bioinformatics Conference (CSB 2004). IEEE Press, 2004, pp [14] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys., vol. 21, pp , [15] W. Hastings, Monte carlo sampling methods using markov chains and their applications, Biometrika, vol. 57, pp , [16] P. A. Goloboff, J. S. Farris, and K. C. Nixon, TNT, a free program for phylogenetic analysis, Cladistics, vol. 24, no. 5, pp , [17] D. L. Swofford, PAUP*: Phylogenetic analysis using parsimony (and other methods), 2002, sinauer Associates, Underland, Massachusetts, Version 4.0. [18] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, [19] L. A. Lewis and P. O. Lewis, Unearthing the molecular phylodiversity of desert soil green algae (chlorophyta), Syst. Bio., vol. 54, no. 6, pp , [20] D. E. Soltis, M. A. Gitzendanner, and P. S. Soltis, A 567-taxon data set for angiosperms: The challenges posed by bayesian analyses of large data sets, Int. J. Plant Sci., vol. 168, no. 2, pp , [21] D. E. Soltis, P. S. Soltis, M. W. Chase, M. E. Mort, D. C. Albach, M. Zanis, V. Savolainen, W. H. Hahn, S. B. Hoot, M. F. Fay, M. Axtell, S. M. Swensen, L. M. Prince, W. J. Kress, K. C. Nixon, and J. S. Farris, Angiosperm phylogeny inferred from 18s rdna, rbcl, and atpb sequences, Botanical Journal of the Linnean Society, vol. 133, pp , [22] O. Bininda-Emonds, Ratchet implementation in PAUP*4.0b10, 2003, available from Emonds. [23] D. F. Robinson and L. R. Foulds, Comparison of phylogenetic trees, Mathematical Biosciences, vol. 53, pp , 1981.
Phylogenetic Trees Made Easy
Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts
More informationBayesian Phylogeny and Measures of Branch Support
Bayesian Phylogeny and Measures of Branch Support Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The
More informationPHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference
PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference Stephane Guindon, F. Le Thiec, Patrice Duroux, Olivier Gascuel To cite this version: Stephane Guindon, F. Le Thiec, Patrice
More informationBio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
More informationArbres formels et Arbre(s) de la Vie
Arbres formels et Arbre(s) de la Vie A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Basic principles DATA sequence alignment distance
More informationWhat mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL
What mathematical optimization can, and cannot, do for biologists Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL Introduction There is no shortage of literature about the
More informationPRec-I-DCM3: a parallel framework for fast and accurate large-scale phylogeny reconstruction
Int. J. Bioinformatics Research and Applications, Vol. 2, No. 4, 2006 407 PRec-I-DCM3: a parallel framework for fast and accurate large-scale phylogeny reconstruction Yuri Dotsenko*, Cristian Coarfa, Luay
More informationA comparison of methods for estimating the transition:transversion ratio from DNA sequences
Molecular Phylogenetics and Evolution 32 (2004) 495 503 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev A comparison of methods for estimating the transition:transversion ratio from
More informationA short guide to phylogeny reconstruction
A short guide to phylogeny reconstruction E. Michu Institute of Biophysics, Academy of Sciences of the Czech Republic, Brno, Czech Republic ABSTRACT This review is a short introduction to phylogenetic
More information4 Techniques for Analyzing Large Data Sets
4 Techniques for Analyzing Large Data Sets Pablo A. Goloboff Contents 1 Introduction 70 2 Traditional Techniques 71 3 Composite Optima: Why Do Traditional Techniques Fail? 72 4 Techniques for Analyzing
More informationIntroduction to Bioinformatics AS 250.265 Laboratory Assignment 6
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues
More informationModel-based Synthesis. Tony O Hagan
Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that
More informationPROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
More informationA Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML
9 June 2011 A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML by Jun Inoue, Mario dos Reis, and Ziheng Yang In this tutorial we will analyze
More informationOnline Consensus and Agreement of Phylogenetic Trees.
Online Consensus and Agreement of Phylogenetic Trees. Tanya Y. Berger-Wolf 1 Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA. tanyabw@cs.unm.edu Abstract. Computational
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
More informationName: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.
Name: Class: Date: Chapter 17 Practice Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The correct order for the levels of Linnaeus's classification system,
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationExtend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
More informationMolecular Clocks and Tree Dating with r8s and BEAST
Integrative Biology 200B University of California, Berkeley Principals of Phylogenetics: Ecology and Evolution Spring 2011 Updated by Nick Matzke Molecular Clocks and Tree Dating with r8s and BEAST Today
More informationHidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
More informationAlgorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
More informationMissing data and the accuracy of Bayesian phylogenetics
Journal of Systematics and Evolution 46 (3): 307 314 (2008) (formerly Acta Phytotaxonomica Sinica) doi: 10.3724/SP.J.1002.2008.08040 http://www.plantsystematics.com Missing data and the accuracy of Bayesian
More information4. How many integers between 2004 and 4002 are perfect squares?
5 is 0% of what number? What is the value of + 3 4 + 99 00? (alternating signs) 3 A frog is at the bottom of a well 0 feet deep It climbs up 3 feet every day, but slides back feet each night If it started
More informationAn experimental study comparing linguistic phylogenetic reconstruction methods *
An experimental study comparing linguistic phylogenetic reconstruction methods * François Barbançon, a Steven N. Evans, b Luay Nakhleh c, Don Ringe, d and Tandy Warnow, e, a Palantir Technologies, 100
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
More informationHolland s GA Schema Theorem
Holland s GA Schema Theorem v Objective provide a formal model for the effectiveness of the GA search process. v In the following we will first approach the problem through the framework formalized by
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationProtein Sequence Analysis - Overview -
Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationPhylogenetic systematics turns over a new leaf
30 Review Phylogenetic systematics turns over a new leaf Paul O. Lewis Long restricted to the domain of molecular systematics and studies of molecular evolution, likelihood methods are now being used in
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationMining Social Network Graphs
Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand
More informationLearning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationSequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationA greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
More informationModeling System Calls for Intrusion Detection with Dynamic Window Sizes
Modeling System Calls for Intrusion Detection with Dynamic Window Sizes Eleazar Eskin Computer Science Department Columbia University 5 West 2th Street, New York, NY 27 eeskin@cs.columbia.edu Salvatore
More informationComparing Bootstrap and Posterior Probability Values in the Four-Taxon Case
Syst. Biol. 52(4):477 487, 2003 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150390218213 Comparing Bootstrap and Posterior Probability Values
More informationBayesian coalescent inference of population size history
Bayesian coalescent inference of population size history Alexei Drummond University of Auckland Workshop on Population and Speciation Genomics, 2016 1st February 2016 1 / 39 BEAST tutorials Population
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More informationForecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
More informationTutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
More informationHeuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations
Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute
More informationEvaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation
Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation Jack Sullivan,* Zaid Abdo, à Paul Joyce, à and David L. Swofford
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationMore details on the inputs, functionality, and output can be found below.
Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing
More informationVisualization of Phylogenetic Trees and Metadata
Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationNetwork Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
More informationBASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s
More informationPhylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective
Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective Alexandros Stamatakis Institute of Computer Science, Foundation for Research and Technology-Hellas P.O. Box 1385, Heraklion,
More informationTHREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
More informationLikelihood: Frequentist vs Bayesian Reasoning
"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B University of California, Berkeley Spring 2009 N Hallinan Likelihood: Frequentist vs Bayesian Reasoning Stochastic odels and
More informationHigh Throughput Network Analysis
High Throughput Network Analysis Sumeet Agarwal 1,2, Gabriel Villar 1,2,3, and Nick S Jones 2,4,5 1 Systems Biology Doctoral Training Centre, University of Oxford, Oxford OX1 3QD, United Kingdom 2 Department
More informationGreedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures
Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures Dmitri Krioukov, kc claffy, and Kevin Fall CAIDA/UCSD, and Intel Research, Berkeley Problem High-level Routing is
More informationPart 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
More informationNetwork (Tree) Topology Inference Based on Prüfer Sequence
Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,
More informationPhylogenetic Analysis using MapReduce Programming Model
2015 IEEE International Parallel and Distributed Processing Symposium Workshops Phylogenetic Analysis using MapReduce Programming Model Siddesh G M, K G Srinivasa*, Ishank Mishra, Abhinav Anurag, Eklavya
More informationModel-Based Cluster Analysis for Web Users Sessions
Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr
More informationNEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES
NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES Silvija Vlah Kristina Soric Visnja Vojvodic Rosenzweig Department of Mathematics
More informationA Network Flow Approach in Cloud Computing
1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges
More informationScaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search
Scaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search André Wehe 1 and J. Gordon Burleigh 2 1 Department of Computer Science, Iowa State University, Ames,
More informationReliability Guarantees in Automata Based Scheduling for Embedded Control Software
1 Reliability Guarantees in Automata Based Scheduling for Embedded Control Software Santhosh Prabhu, Aritra Hazra, Pallab Dasgupta Department of CSE, IIT Kharagpur West Bengal, India - 721302. Email: {santhosh.prabhu,
More informationAutomated Plausibility Analysis of Large Phylogenies
Automated Plausibility Analysis of Large Phylogenies Bachelor Thesis of David Dao At the Department of Informatics Institute of Theoretical Computer Science Reviewers: Advisors: Prof. Dr. Alexandros Stamatakis
More informationScaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce
Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu
More informationData Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationIntroduction to Phylogenetic Analysis
Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.
More informationFormal Languages and Automata Theory - Regular Expressions and Finite Automata -
Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March
More informationScalable, Updatable Predictive Models for Sequence Data
Scalable, Updatable Predictive Models for Sequence Data Neeraj Koul, Ngot Bui, Vasant Honavar Artificial Intelligence Research Laboratory Dept. of Computer Science Iowa State University Ames, IA - 50014,
More informationIE 680 Special Topics in Production Systems: Networks, Routing and Logistics*
IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti
More informationA Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
More informationBuilding a phylogenetic tree
bioscience explained 134567 Wojciech Grajkowski Szkoła Festiwalu Nauki, ul. Ks. Trojdena 4, 02-109 Warszawa Building a phylogenetic tree Aim This activity shows how phylogenetic trees are constructed using
More informationExtension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationName Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.
Section 1: The Linnaean System of Classification 17.1 Reading Guide KEY CONCEPT Organisms can be classified based on physical similarities. VOCABULARY taxonomy taxon binomial nomenclature genus MAIN IDEA:
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationA Binary Model on the Basis of Imperialist Competitive Algorithm in Order to Solve the Problem of Knapsack 1-0
212 International Conference on System Engineering and Modeling (ICSEM 212) IPCSIT vol. 34 (212) (212) IACSIT Press, Singapore A Binary Model on the Basis of Imperialist Competitive Algorithm in Order
More informationMemory Allocation Technique for Segregated Free List Based on Genetic Algorithm
Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College
More informationData Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
More informationBig Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
More informationA data management framework for the Fungal Tree of Life
Web Accessible Sequence Analysis for Biological Inference A data management framework for the Fungal Tree of Life Kauff F, Cox CJ, Lutzoni F. 2007. WASABI: An automated sequence processing system for multi-gene
More informationStatistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
More informationBioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationSYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis
SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationCOMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
More informationBioinformatics: Network Analysis
Bioinformatics: Network Analysis Graph-theoretic Properties of Biological Networks COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Architectural features Motifs, modules,
More informationA Brief Study of the Nurse Scheduling Problem (NSP)
A Brief Study of the Nurse Scheduling Problem (NSP) Lizzy Augustine, Morgan Faer, Andreas Kavountzis, Reema Patel Submitted Tuesday December 15, 2009 0. Introduction and Background Our interest in the
More informationGRAPH THEORY LECTURE 4: TREES
GRAPH THEORY LECTURE 4: TREES Abstract. 3.1 presents some standard characterizations and properties of trees. 3.2 presents several different types of trees. 3.7 develops a counting method based on a bijection
More informationClassification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
More informationComparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationDnaSP, DNA polymorphism analyses by the coalescent and other methods.
DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
More information