4 Techniques for Analyzing Large Data Sets

Transcription

1 4 Techniques for Analyzing Large Data Sets Pablo A. Goloboff Contents 1 Introduction 70 2 Traditional Techniques 71 3 Composite Optima: Why Do Traditional Techniques Fail? 72 4 Techniques for Analyzing Large Data Sets Ratchet Sectorial searches Tree-fusing Tree-drifting Combined methods Minimum length: multiple trees or multiple hits? 76 5 TNT: Implementation of the New Methods 77 6 Remarks and Conclusions 78 Acknowledgments 79 References 79 1 Introduction Parsimony problems of medium or large numbers of taxa can be analyzed only by means of trial-and-error or "heuristic" methods. Traditional strategies for finding most parsimonious trees have long been in use, implemented in the programs Hennig86 [1], PAUP [2], and NONA [3]. Although successful for small and medium-sized data sets, these techniques normally fail for analyzing very large data sets, i. e., data sets with 200 or more taxa. This is because rather than simply requiring more of the same kind of work used to analyze smaller data sets, very large data sets require the use of qualitatively different techniques. The techniques described here have so far been used only for prealigned sequences, but they could be adapted for other methods of analysis, like the direct optimization method of Wheeler [4]. Methods and Tools in Biosciences and Medicine Techniques in molecular systematics and evolution, ed. by Rob DeSalle et al Birkhauser Verlag Basel/Switzerland

2 2 Traditional Techniques The two basic heuristic computational techniques for finding most parsimonious trees are wagner trees and branch-swapping. A wagner tree is a tree created by sequentially adding the taxa at the most parsimonious available branch. At each point during the addition of taxa, only part of the data are actually used. A taxon may be placed best in some part of the tree when only some taxa are present, but it may be placed best somewhere else when all the taxa are considered. Therefore, which taxa have been added determines the outcome of a wagner tree, so that different addition sequences will lead - for large data sets - to different results. Branch-swapping is a widely used technique for improving the trees produced by the wagner method. Branchswapping takes a tree and evaluates the parsimony of each of a series of branch-rearrangements (discarding, adding, or replacing the new tree if it is, respectively, worse, equal, or better than previously found trees). The number of rearrangements to complete swapping depends strongly on the number of taxa. The most widely used branch-swapping algorithm is "tree bisection reconnection" or TBR ([5], called "branch-breaking" in Hennig86; [1]). In TBR, the tree is clipped in two, and the two subtrees are rejoined in each possible way. The number of rearrangements to complete TBR increases with the cube of the number of taxa, and thus the time needed to complete TBR on a tree of twice the taxa is much more than twice the time. Thus, if a tree of 10 taxa requires x rearrangements for complete swapping, a tree of 20 taxa will require 8x, 40 taxa will require 50x and 80 taxa will require 400x. Because of special short-cuts, which allow deducing tree length for rearrangements without unnecessary calculations (see [6, 7] for basic descriptions, and [8], for a description of techniques for multi-character optimization), the rearrangements for larger trees can in many cases be evaluated more quickly than the rearrangements for smaller trees. Therefore, the time for swapping increases in those cases with less than the cube of the number of taxa (although it is still more than the square). In implementations which do not (or cannot) use some of these short-cuts, the time to complete TBR may well increase with the cube of the number of taxa (the use of some of the techniques described here, like sectorial searches and tree-fusing, would be even more beneficial under those circumstances). For even relatively small data sets (i. e., 30 or 40 taxa), TBR may be unable, given some starting trees, to find the most parsimonious trees. In computer science, this is known as the problem of local optima (known in systematics as the problem of "islands" of trees; [9]) This is easily visualized by thinking of the parsimony of the trees as a "landscape" with peaks and valleys. The goal of the analysis is to get to the highest possible peak; this is done by taking a series of "steps" in several possible directions, going back if the step took us to a lower elevation, continuing from the new point if the step took us higher. Note that if the "steps" with which the swapping algorithm "walks" in this landscape are too

3 72 Pablo A. Goloboff short, it may easily get trapped in an isolated peak of non-maximal height. To reach higher peaks, the algorithm would have to descend and then go up again - but the algorithm does not do so, by virtue of its own design. The two traditional strategies around the problem of local optima for the TBR algorithm are the use of multiple starting points for TBR and the retention of suboptimal trees during swapping. The first is more efficient and is thus the only one that will be considered here. The multiple starting points for TBR are best obtained by doing wagner trees using different addition sequences to create multiple wagner trees. Typically, the addition sequence can be randomized to obtain many different wagner trees to be later input to TBR - this has been termed a "random addition sequence" or RAS. The expectation is that some of the initial trees will eventually be near or on the slopes of the highest peaks. For data sets of 50 to 150 taxa, this method generally works well, although it may require the use of large numbers of RAS+TBR. The strategy of RAS+TBR, however, is very inefficient for data sets of much larger size. It might appear in principle that larger data sets might simply require a larger number of replications, but the number of RAS+TBR needed to actually find optimal trees for data sets with 500 or more taxa seems to increase exponentially. 3 Composite Optima: Why do Traditional Techniques Fail? Traditional techniques fail because very large trees can exhibit what Goloboff [10] termed composite optima. The TBR algorithm can get stuck in local optima for many data sets with taxa. But a tree with (say) 500 taxa has many regions or sectors that can be seen as sub-problems of 50 taxa. Each of these sub-problems might have its own "local" and "global" optima. Whether a given sector is in a globally optimal configuration will be, to some extent, independent of whether other sectors in the tree are in their optimal configurations. For a tree to be optimal, all sectors in the tree have to be in a globally optimal configuration, but the chances of achieving this result in a given RAS+TBR may be extremely low. If five sectors of the tree are in an optimal configuration, just starting a new RAS+TBR will possibly place other sectors of the tree in optimal configurations, but it is unlikely also to place the same five sectors that were optimal again in optimal configurations. Consider the following analogy: you have six dice, and the goal is to achieve the highest sum of values by throwing them. You can either take the six dice and throw all of them at once, in which case the probability of getting the highest value is (1/6) 6, or 2 in 100,000. Or, you can use a divisive strategy: throw all the dice together only once, and then take each of the six dice and, in turn, throw it 50 times, keeping the highest value in each case. In the first case, you may well not find the highest possible value in 100,000 throws. With the divisive strategy of the second case, you would be

4 Techniques for Analyzing Large Data Sets 73 almost guaranteed to find the highest possible value with a total of 301 throws. In the real world, parsimony problems do not have sectors clearly identified as the dice, and the resolution among different sectors is often not really independent. This simply makes the problem more difficult. It is then easy to understand why finding a shortest tree using RAS+TBR may become so difficult for large real data sets. Consider a tree of 500 taxa; such a tree could have 10 different sectors which can have its own local optima; if a given RAS+TBR has a chance of 0.5 to find a globally optimal configuration for a given sector, then the chances of a given RAS+TBR to find a most parsimonious tree are , or less than 1 in 1,000. Thus, not only the number of rearrangements necessary to complete TBR swapping on trees with more taxa increases exponentially, but so does the number of replications of RAS+TBR that have to be done in order to find optimal trees. 4 Techniques for Analyzing Large Data Sets The best way to analyze data sets with composite optima will be by means analogous to the divisive strategy described above for the dice. Re-starting a new replication every time a replication of RAS+TBR gets stuck will simply not do the job in a reasonable time. There are four basic methods that have been proposed to cope with the problem of local optima. The first one to be developed is the parsimony ratchet ([11], originally presented at a symposium in 1998; see [12]). Subsequently developed methods are sectorial-searches, tree-fusing and tree-drifting [10]. The expected difference in performance between the traditional and these new techniques is about as much as one would expect for the two strategies for throwing the dice. 4.1 Ratchet The ratchet is based on slightly perturbing the data once the TBR gets stuck, repeating a TBR search for the perturbed data using the same tree as starting point, then using that tree for searching again under the original data. The perturbation is normally done by either increasing the weights of a proportion (10 to 15%) of the characters, or by eliminating some characters, as in jackknifing (but with lower probabilities of deletion). The TBR searches for both the perturbed and the original data must be made saving only one (or very few) trees. The effectiveness of the ratchet is not significantly increased by saving more trees, but run times are (see [11] for details). The ratchet works because the perturbation phase makes partial changes to the tree, but without changing its entire structure. The changes are made, at each round, to only part of the tree, improving, it is hoped, the tree a few parts at a time. In the end, the changes made by the ratchet are determined by

5 74 Pablo A. Goloboff character conflict: a given TBR rearrangement can improve the tree for the perturbed data only if some characters actually favor the alternative groupings. Since it is character conflict in the first place that determines the existence of local optima, the ratchet addresses the problem of local optima at its very heart. The ratchet is very effective for finding shortest trees. In the case of the 500- taxon data set of Chase et al. [13], the ratchet can find a shortest tree in about 2 hours (on a 266 MHz pentium II machine). Using only multiple RAS+TBR, it takes from 48 to 72 hours to find minimum length for that data set. 4.2 Sectorial searches The sectorial searches choose a sector of the tree with a size such that it can be properly handled by the TBR algorithm, create a reduced data set for that part of the tree, and analyze that sector by doing some number of RAS+TBR (without saving multiple trees). Then the best tree for the sector is replaced onto the entire tree. The process is repeated several times, choosing different sectors. The sectors can be chosen at random, or based on a consensus previously calculated by some means. Details are given in Goloboff [10]. Sectorial searches find short trees much more effectively than TBR alone; in the case of Chase et al.'s data set, finding trees under steps using TBR alone would require using over 10 times more replications than when using sectorial searches, and this would take about 7 times longer. Sectorial searches alone rarely find an optimal tree for large data sets. Used alone, they are less effective than the ratchet, normally going down to some non-minimal length (much lower than TBR alone), and then they get stuck. Sectorial searches, however, analyze many reduced data sets, which take almost no time at all. They thus have the advantage that they get down to a non-minimal length faster than the ratchet. They are then useful as initial stages of the search, in combination with other methods. 4.3 Tree-fusing Tree-fusing takes two trees and evaluates all possible exchanges of sub-trees with identical taxon-composition. The sub-tree exchanges that improve the tree are then actually made. See Goloboff [10] for details. Tree-fusing is best done by successively fusing pairs of trees and thus needs several trees as input to produce results; getting those trees will require several replications of RAS+TBR, possibly followed by some other method (like a sectorial search, ratchet, or tree-drifting). Once several close-to-optimal trees have been obtained, tree-fusing produces dramatic improvements in almost no time. It is easy to see why: each of the sectors will be in an optimal configuration in at least

6 Techniques for Analyzing Large Data Sets 75 some of the trees, and tree-fusing simply merges together those optimal sectors to achieve a globally optimal tree. In this sense, tree-fusing makes it possible to make good use of trees which are not globally optimal, as long as they have at least some sectors in optimal configuration. 4.4 Tree-drifting Tree-drifting is based on an idea quite similar to that of the ratchet. It is based on doing rounds of TBR, alternatively accepting only optimal, and suboptimal as well as optimal trees. The suboptimal trees are accepted, during the drift phase, with a probability that depends on how suboptimal the trees are. One of the key components of the method is the function for determining the probability of acceptance, which is based on both the absolute step difference and a measure of character conflict (the relative fit difference, which is the ratio of steps gained and saved in all characters, between the two trees; see [14]). Trees as good as or better than the one being swapped are always accepted. Once a given number of rearrangements has been accepted, a round of TBR accepting only optimal trees is made, and the process is repeated (as in the ratchet) a certain number of times. Tree-drifting is about as effective as the ratchet at finding shortest trees, although in current implementations tree-drifting seems to find minimum length about two to three times faster than the ratchet itself. This difference is probably a consequence of the fact that the ratchet analyzes the perturbed data set until completion of TBR, while the equivalent phase in tree-drifting only does a fixed number of replacements. Since there is no point in having the ratchet find the actually optimal trees for the perturbed data, the ratchet could be easily modified such that the perturbed phase finishes as soon as a certain number of rearrangements has been accepted. Most likely this would make the ratchet about as fast as tree-drifting. 4.5 Combined methods The methods described above can be combined. Thus, the best results have been obtained when RAS+TBR is first followed by Sectorial searches, then some drift or ratchet, and the results are fused. Repeating this procedure will sometimes find minimum length much more quickly than other times. If the procedure uses (say) ten initial replications, on occasion the first four or five replications will find a shortest tree, the rest of the time effectively being wasted -at least as far as hitting minimum length is concerned. On other occasions, the ten replications will not be enough to find minimum length, but then there is no point in starting from scratch with another ten replications: maybe just adding a few more, and tree-fusing those new replications with the previous ten ones,

7 76 Pablo A. Goloboff will do the job. The most efficient results, unsurprisingly, are then obtained when the methods described above are combined, and the parameters for the entire search are supervised and changed at run time. At each point, the number of initial replications is changed according to how many replications had to be used in previous hits to minimum length; if fewer replications were needed, the number is decreased, and vice versa. Goloboff [10] suggested that it would also be beneficial to change the number of Sectorial searches as well, and the number of drift cycles, to be done within each replication (although this has not been actually implemented so far). The process just described in the end also makes it likely that the best results obtained correspond to the actual minimum length. Each hit to minimum length will use as many initial replications as necessary to reproduce the previously found best length; if the length used so far as bound is in fact not optimal, shorter trees will eventually be found. With every certain number of hits to minimum length, the results from all previous replications can be submitted to tree-fusing. If the trees from several independent hits to some length do not produce shorter trees when subject to fusing, it is likely that that length represents indeed the minimum possible (and thus tree-fusing provides an additional criterion, beyond mere convergence, to determine whether the actual minimum length has been found in a particular case). Alternatively, the search parameters can be made very aggressive (i. e., many replications, with lots of drifting and fusing, etc.) at first, to make sure that one has the actual minimum length, and subsequently they can be switched to the more effortsaving strategy when it comes to determining the consensus tree for the data set being analyzed. 4.6 Minimum length: multiple trees or multiple hits? The approach to parsimony analysis for many years has been that of trying to actually find each and every possible most parsimonious tree for the data. Getting all possible most parsimonious tree for large data sets can be a difficult task (since there can be millions of them). What is more important, for the purpose of taxonomic studies, is that there is absolutely no point in doing so. Since the trees found are to be used to create a (strict) consensus tree, it would be much less wasteful to simply gather the minimum number of trees necessary to produce the same consensus that would be produced by all possible most parsimonious trees. In this sense, it is more fruitful to find additional trees of minimum length by producing new, independent hits to minimum length, than it is to find trees from the same hit by doing TBR saving multiple trees. Doing TBR saving multiple trees will produce, by necessity, trees which are in the same local optimum or island, differing by few rearrangements, while the trees from new hits to minimum length could, potentially, be more different -possibly belonging to different islands. The consensus from a few trees from indepen-

8 Techniques for Analyzing Large Data Sets 77 dent hits to minimum length is likely to be the same as the consensus from every possible most parsimonious tree, especially when the trees are collapsed more stringently. The trees can be collapsed by applying the TBR algorithm, not to find multiple trees, but rather to collapse all the nodes between source and destination node when a rearrangement produces a tree of the same length as the tree being swapped. This allows production of the same results as would be produced by saving large numbers of trees, but more quickly and using less RAM. This is one of the main ideas in Farris et al.'s [15] paper, further explored in Goloboff and Farris [14]. Thus, current implementation of the methods described here exploits this idea. As minimum length is successively hit, the consensus for the results obtained so far can be calculated. The consensus will become less and less resolved with additional hits to minimum length, up to a point, where it will become stable. Once additional hits to minimum length do not further de-resolve the consensus, the search can be stopped, and it is likely that the consensus corresponds to the same consensus that would be obtained if each and every most parsimonious tree was used to produce a consensus. If the user wants more confidence that the actual consensus has been obtained, once the consensus became stable, it is possible to restart calculating a consensus from the new (subsequent) hits to minimum length, until it becomes stable again; the grand consensus of both consensuses is less likely to contain spurious groups (i. e., actually unsupported groups, present in some most parsimonious trees, but not in all of them). For Chase et al.'s data set, when the consensus is calculated every three hits to minimum length, until stability is achieved twice, the analysis takes (on a 266 MHz Pentium II) an average time of only 4 hours (minimum length being hit 20 to 40 times). The exact consensus is obtained 80% of the time, but the 20% of the cases where the consensus is not exact exhibit only one or two spurious nodes. The consensus could be made more reliable by re-calculating it until stability is reached more times, and by re-calculating it less frequently (e. g., every five hits to minimum length, instead of three). This is in stark contrast with a search like Rice et al.'s [16] analysis, based on ca trees (found in 3.5 months of analysis) from a single replication, which produced 46 spurious nodes. 5 TNT: Implementation of the New Methods The techniques described here have been implemented in "Tree analysis using New Technology" (TNT), a new program by P. Goloboff, J. Farris, and K. Nixon [17]. The program is still a prototype, but demonstration versions are available from The program has a full Windows interface (although command-driven versions for other operating systems are anticipated). The input format is as for Hennig86 and NONA (see Siddall, this volume). The program allows the user to change the parameters of the search, either by hand, or by letting the program try to identify the best parameters for a given size of data set and degree of exhaustiveness. In general, a few recommendations can be made. Data sets with

9 78 Pablo A. Goloboff fewer than 100 taxa will be difficult to analyze only when extremely incongruent. In those cases, the methods of tree-fusing and Sectorial searches perform more poorly (these methods assume that some isolated sectors in the tree can indeed be identified, but this is unlikely to be the case for such data sets). Therefore, smaller data sets are best analyzed by means of extensive ratchet and/or tree-drifting, reducing tree-fusing and Sectorial searches to a minimum. Larger data sets can be analyzed with essentially only Sectorial searches plus tree-fusing if they are rather clean. However, as data sets become more difficult, it is necessary to increase not only the number of initial replications, but also the exhaustiveness of each replication. This is best done by selecting (at some point in each of the initial replications) sectors of larger size and analyzing them with tree-drift instead of simply RAS+TBR (this is the "DSS" option in the sectorial-search dialogue of TNT). Larger sectors are more likely to identify areas of conflict, and it is less likely that better solutions will be missed, because they would require that some taxon be moved outside the sector being analyzed. After certain number of sector selections are analyzed with tree-drifting, several cycles of global tree-drifting further improve the trees, before submitting them to tree-fusing. The tree-drifting can be done faster if some nodes are constrained during the search (the constraint is created from a consensus of the previous tree and the tree resulting from the perturbed round of TBR; see [10]). This might conceivably decrease the effectiveness of the drift, but it can be countered by doing an unconstrained cycle of drift with some periodicity, and since it means more cycles of drift per unit time, in the end it means an increase in effectiveness. The "hard cycles" option in the "Drift" dialogue box of TNT sets the number of hard drift cycles to do before an unconstrained cycle is done. If large numbers of drift cycles are to be done, it is advisable to set the hard cycles so that a large portion of the drift cycles are constrained (e.g., eight or nine out of ten). For difficult data sets, making the searches more exhaustive will take more time per replication, but in the end will mean that minimum length can be found much more quickly. The number of hits to re-check for consensus stability and the number of times the consensus should reach stability are changed from the main dialogue box of the "New Technology Search." As discussed above, this determines the reliability of the consensus tree obtained, with larger numbers meaning more reliable results. If the user so prefers, he may simply decide to hit minimum length a certain number of times and then let the program stop. 6 Remarks and Conclusions New methods for analysis of large data sets perform at speeds that were unimaginable only a few years ago. Parsimony problems of a few hundred taxa had been considered "intractable" by many authors, but they can now be easily analyzed. No doubt the enormous progress made in the last few years in this area has been facilitated by the fact that people have recently started publishing and openly discussing new algorithms and ideas. Although at

10 Techniques for Analyzing Large Data Sets 79 present it is difficult to predict whether the currently used methods will be further improved, the possibility certainly exists: the field of computational cladistics is still an area of active discussion and ferment. Acknowledgments The author wishes to thank Martin Ramirez and Gonzalo Giribet for comments and help during the preparation of the manuscript. Part of the research was carried out with the deeply appreciated support from PICT (Agencia Nacional de Promociónn Cientifica y Tecnológica), and from PEI 0324/ 97 (CONICET). References Farris JS (1988) HENNIG 86, v. 1.5, 10 program and documentation. Port Jefferson, NY Swofford DL (1993) PAUP: Phylogenetic analysis using parsimony, v , pro- 11 gram and documentation, Illinois Goloboff PA (1994b) Nona, v , program and documentation. Available at 12 ftp.unt.edu.ar/pub/parsimony Wheeler WC (1996) Optimization alignment: the end of multiple sequence 13 alignment in phylogenetics? Cladistics 12: 1-9 Swofford D, Olsen G (1990) Phylogeny reconstruction. In: D Hillis and C Moritz (eds.): Molecular Systematics Goloboff PA (1994a) Character optimization and calculation of tree lengths. Cladistics 9: Goloboff PA (1996) Methods for faster parsimony analysis. Cladistics 12: Moilanen A (1999) Searching for most 16 parsimonious trees with simulated evolutionary optimization. Cladistics 15: Maddison D (1991) The discovery and importance of multiple islands of most parsimonious trees. Syst. Zool., 40: Goloboff PA (1999) Analyzing large data sets in reasonable times: solutions for composite optima. Cladistics 15: Nixon KC (1999) The Parsimony Ratchet, a new method for rapid parsimony analysis. Cladistics 15: Horovitz I (1999) A report on "One Day Symposium on Numerical Cladistics". Cladistics 15: Chase MW, Soltis DE, Olmstead RG, Morgan D et al. (1993) Phylogenetics of seed plants: An analysis of nucleic sequences from the plastid gene rbcl. Ann. Mo. Bot. Gard. 80: Goloboff PA, Farris JS (2001) Methods for quick consensus estimation. Cladistics 17: Farris JS, Albert VA, Kallersjo M, Lipscomb, D et al. (1996) Parsimony jackknifing outperforms neighbor-joining. Cladistics 12: Rice KA, Donoghue MJ, Olmstead RG (1997) Analyzing large data sets: rbcl 500 revisited. Syst. Biol. 46: Goloboff PA, Farris JS, Nixon KC (1999) T.N. T.: Tree analysis using New Technology. Available at