Reconstructing phylogenies: how? how well? why?

Transcription

1 Joe Felsenstein Department of Genome Sciences and Department of Biology University of Washington, Seattle p.1/29

2 A review that asks these questions What are some of the strengths and weaknesses of different ways of reconstructing evolutionary trees (phylogenies)? p.2/29

3 A review that asks these questions What are some of the strengths and weaknesses of different ways of reconstructing evolutionary trees (phylogenies)? How can we find out how accurate we may have been in reconstructing the phylogeny? p.2/29

4 A review that asks these questions What are some of the strengths and weaknesses of different ways of reconstructing evolutionary trees (phylogenies)? How can we find out how accurate we may have been in reconstructing the phylogeny? Why do we want to reconstruct it? What are phylogenies used for? p.2/29

5 What does tree space (with branch lengths) look like? an example: three species with a clock A B C trifurcation t 1 t 2 t 1 not possible OK etc. t 2 when we consider all three possible topologies, the space looks like: t 1 t 2 p.3/29

6 For one tree topology The space of trees varying all 2n 3 branch lengths, each a nonegative number, defines an orthant" (open corner) of a 2n 3-dimensional real space: B A v 1 F v 6 v 2 3 C wall 8v 9 v v v 7 v 5 v 4 D floor wall v 9 E p.4/29

7 Through the looking-glass Shrinking one of the n 1 interior branches to 0, we arrive at a trifurcation: B A v 1 F v 6 v 2 3 v v v 8 7 v 9 v 5 v 4 C D A v 1 F v 6 v 7 B v 2 v3 v C 8 v 4 D v 9 v 5 E A v 1 F v 6 B v 2 3 v v v 8 7 E v 5 v 4 D C A v 1 F v 6 v 7 v 9 B v 2 v3 v8 v v 4 5 C E E Here, as we pass through the looking glass" we are also touch the space for two other tree topologies, and we could decide to enter either. D p.5/29

8 The graph of all trees of 5 species The space of all these orthants, one for each topology, connecting ones that share faces (looking glasses): A C D B E A D B E C B C E D A A E B C D B A C D E B D C A E A B C D E A B C E D A B D C E A D B C E A C B D E C A B D E A E B D C A E C B D D A B E C The Schoenberg graph (all 15 trees of size 5 connected by NNI s) p.6/29

9 There are very large numbers of trees For 21 species, the number of possible unrooted tree topologies exceeds Avogadro s Number: it is = 8, 200, 794, 532, 637, 891, 559, and that s not even asking about how hard it is to optimize the 39 branch lengths for each of these trees. What this goes with is that most methods of finding the best tree are NP-hard, and not easy to approximate either. p.7/29

10 Parsimony methods Alpha Delta Gamma Beta Epsilon Sites Species Alpha T A G C A T Beta C A A G C T Gamma T C G G C T Delta T C G C A A Epsilon C A A C A T p.8/29

11 Advantages and disadvantages of parsimony methods Disadvantage: not model-based so people think it makes no assumptions. p.9/29

12 Advantages and disadvantages of parsimony methods Disadvantage: not model-based so people think it makes no assumptions. Advantage: reasonably fast, no search of branch lengths needed and quick to compute the criterion. p.9/29

13 Advantages and disadvantages of parsimony methods Disadvantage: not model-based so people think it makes no assumptions. Advantage: reasonably fast, no search of branch lengths needed and quick to compute the criterion. Advantage: good statistical properties when amounts of change are small. p.9/29

14 Advantages and disadvantages of parsimony methods Disadvantage: not model-based so people think it makes no assumptions. Advantage: reasonably fast, no search of branch lengths needed and quick to compute the criterion. Advantage: good statistical properties when amounts of change are small. Disadvantage: statistical misbehavior (inconsistency) when some nearby branches on the tree are long (Long Branch Attraction). p.9/29

15 Advantages and disadvantages of parsimony methods Disadvantage: not model-based so people think it makes no assumptions. Advantage: reasonably fast, no search of branch lengths needed and quick to compute the criterion. Advantage: good statistical properties when amounts of change are small. Disadvantage: statistical misbehavior (inconsistency) when some nearby branches on the tree are long (Long Branch Attraction). Disadvantage: likely to make you think you have William of Ockham s endorsement. p.9/29

16 Advantages and disadvantages of parsimony methods Disadvantage: not model-based so people think it makes no assumptions. Advantage: reasonably fast, no search of branch lengths needed and quick to compute the criterion. Advantage: good statistical properties when amounts of change are small. Disadvantage: statistical misbehavior (inconsistency) when some nearby branches on the tree are long (Long Branch Attraction). Disadvantage: likely to make you think you have William of Ockham s endorsement. Disadvantage: may lead to the delusion that you know exactly what happened in evolution, in detail. p.9/29

17 The sequences: Distance matrix methods yield distances: A B C D E A B C D E CCTAACCTCTGACCC... CGTAACCTCCGGCCC... CGTAACCTCTGGCCC... CGCAACCTCTGGCTC... CCTAACCTCTGGCCC... A B C D alter tree until predictions match observed distances as closely as possible E A suggested tree: A C D B E predicts: A B C D E compare: A B C D E p.10/29

18 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. p.11/29

19 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. Advantage: it s geometry so mathematical scientists love it. p.11/29

20 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. Advantage: it s geometry so mathematical scientists love it. Advantage: often fast (especially Neighbor-Joining method), can handle large numbers of sequences. p.11/29

21 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. Advantage: it s geometry so mathematical scientists love it. Advantage: often fast (especially Neighbor-Joining method), can handle large numbers of sequences. Disadvantage: not using data fully statistically efficiently. p.11/29

22 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. Advantage: it s geometry so mathematical scientists love it. Advantage: often fast (especially Neighbor-Joining method), can handle large numbers of sequences. Disadvantage: not using data fully statistically efficiently. Advantage: when tested by simulation, found to be surprisingly efficient anyway. p.11/29

23 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. Advantage: it s geometry so mathematical scientists love it. Advantage: often fast (especially Neighbor-Joining method), can handle large numbers of sequences. Disadvantage: not using data fully statistically efficiently. Advantage: when tested by simulation, found to be surprisingly efficient anyway. Disadvantage: cannot easily propagate some information about local features in the sequences from one distance calculation to another. p.11/29

24 Advantages and disadvantages of distance methods Advantage: model-based so assumptions are clearer. Advantage: it s geometry so mathematical scientists love it. Advantage: often fast (especially Neighbor-Joining method), can handle large numbers of sequences. Disadvantage: not using data fully statistically efficiently. Advantage: when tested by simulation, found to be surprisingly efficient anyway. Disadvantage: cannot easily propagate some information about local features in the sequences from one distance calculation to another. Disadvantage: it s geometry so mathematical scientists hang onto it beyond the point of reason. p.11/29

25 Maximum likelihood A C C C G t i are t 1 "branch lengths", x t2 t 7 t 3 z t 4 t 5 y t 6 (rate X time) w t 8 To compute the likelihood for one site, sum over all possible states (bases) at interior nodes: L (i) = x y z w Prob (w) Prob (x w, t 7 ) Prob (A x, t 1 ) Prob (C x, t 2 ) Prob (z w, t 8 ) Prob (C z, t 3 ) Prob (y z, t 6 ) Prob (C y, t 4 ) Prob (G y, t 5 ) p.12/29

26 Advantages and disadvantages of likelihood Advantage: uses a model, so assumptions are clear. p.13/29

27 Advantages and disadvantages of likelihood Advantage: uses a model, so assumptions are clear. Advantage: fully statistically efficient. p.13/29

28 Advantages and disadvantages of likelihood Advantage: uses a model, so assumptions are clear. Advantage: fully statistically efficient. Disadvantage: computationally slower. p.13/29

29 Advantages and disadvantages of likelihood Advantage: uses a model, so assumptions are clear. Advantage: fully statistically efficient. Disadvantage: computationally slower. Advantage: statistical testing by likelihood ratio tests available p.13/29

30 Advantages and disadvantages of likelihood Advantage: uses a model, so assumptions are clear. Advantage: fully statistically efficient. Disadvantage: computationally slower. Advantage: statistical testing by likelihood ratio tests available Disadvantage: can t use the LRT test to test tree topologies. p.13/29

31 Bayesian inference methods Basically uses the likelihood machinery, and adds priors on parameters and on trees. Implemented by Markov chain Monte Carlo methods to sample from the posterior on trees (or parameters, or both). Very popular right now. Advantage: interpretation is straightforward, once the assumptions are met. p.14/29

32 Bayesian inference methods Basically uses the likelihood machinery, and adds priors on parameters and on trees. Implemented by Markov chain Monte Carlo methods to sample from the posterior on trees (or parameters, or both). Very popular right now. Advantage: interpretation is straightforward, once the assumptions are met. Advantage: gives you what you want, the probability of the result. p.14/29

33 Bayesian inference methods Basically uses the likelihood machinery, and adds priors on parameters and on trees. Implemented by Markov chain Monte Carlo methods to sample from the posterior on trees (or parameters, or both). Very popular right now. Advantage: interpretation is straightforward, once the assumptions are met. Advantage: gives you what you want, the probability of the result. Disadvantage: how long is long enough to run the MCMC? p.14/29

34 Bayesian inference methods Basically uses the likelihood machinery, and adds priors on parameters and on trees. Implemented by Markov chain Monte Carlo methods to sample from the posterior on trees (or parameters, or both). Very popular right now. Advantage: interpretation is straightforward, once the assumptions are met. Advantage: gives you what you want, the probability of the result. Disadvantage: how long is long enough to run the MCMC? Disadvantage: where do we get priors from, what effect do they have? p.14/29

35 Bayesian inference methods Basically uses the likelihood machinery, and adds priors on parameters and on trees. Implemented by Markov chain Monte Carlo methods to sample from the posterior on trees (or parameters, or both). Very popular right now. Advantage: interpretation is straightforward, once the assumptions are met. Advantage: gives you what you want, the probability of the result. Disadvantage: how long is long enough to run the MCMC? Disadvantage: where do we get priors from, what effect do they have? Disadvantage: they keep chanting in unison We are the statisticians of Bayes you will be assimilated. p.14/29

36 Aren t these graphical models? x x x x x x x x x x x x x x x v 1 x x x x x v 5 v 6 x x x x x v v 3 x x x x x x x x x x v 8 v 7 x x x x x v 9 x x x x x (You have to imaging it going back 500 layers or so). The problem is to use the data, which is at the tips but not available for the interior nodes, to infer the topology and branch lengths of the tree that is shared by all sites. p.15/29

37 Could we use graphical model machinery here? Like Moliére s character who is delighted to discover that he s been speaking prose all his life, we found we had already been using the relevant Graphical Model machinery since about So alas there was nothing to gain. The same thing is true for statistical genetics, where the graphical model machinery reinvents the standard peeling algorithms for computing likelihoods on pedigrees, in use since p.16/29

38 Bootstrap sampling of phylogenies Original Data sites sequences p.17/29

39 Draw columns randomly with replacement Original Data sites sequences Bootstrap sample #1 sites Estimate of the tree sequences sample same number of sites, with replacement p.18/29

40 Make a tree from that resampled data set Original Data sites sequences Bootstrap sample #1 sites Estimate of the tree sequences sample same number of sites, with replacement Bootstrap estimate of the tree, #1 p.19/29

41 Draw another bootstrap sample Original Data sites sequences Bootstrap sample #1 sites Estimate of the tree sequences sample same number of sites, with replacement Bootstrap sample #2 sequences sites sample same number of sites, with replacement Bootstrap estimate of the tree, #1 (and so on) p.20/29

42 ... and get a tree for it too. And so on. Original Data sites sequences Bootstrap sample #1 sites Estimate of the tree sequences sample same number of sites, with replacement Bootstrap sample #2 sequences sites sample same number of sites, with replacement Bootstrap estimate of the tree, #1 (and so on) Bootstrap estimate of the tree, #2 p.21/29

43 Summarizing the cloud of trees by support for branches Bovine Mouse Squir Monk Chimp Human Gorilla Orang Gibbon Rhesus Mac Jpn Macaq Crab E.Mac BarbMacaq Tarsier Lemur p.22/29

44 Some alternatives to bootstrapping Parametric bootstrapping same, but simulate data sets from our best estimate of the tree instead of sampling sites. p.23/29

45 Some alternatives to bootstrapping Parametric bootstrapping same, but simulate data sets from our best estimate of the tree instead of sampling sites. Bayesian inference of course gets statistical support information from the posterior. p.23/29

46 Some alternatives to bootstrapping Parametric bootstrapping same, but simulate data sets from our best estimate of the tree instead of sampling sites. Bayesian inference of course gets statistical support information from the posterior. The Kishino-Hasegawa-Templeton test (KHT test) which compares prespecified trees to each other by paired sites tests. p.23/29

47 Why want to know the tree? It affects all parts of the genomes it is the essential part of propagating information about the evolution of one part of the genome to inquiries about another part. The standard method for finding functional regions of the genome is now using PhyloHMMs which use Hidden Markov Model machinery together with phylogenies to find regions that have unusually low rates of evolution. p.24/29

48 Another kind of tree: the coalescent Coalescent trees are trees of ancestry of copies of a single gene locus within a species. They are weakly inferrable as most have only a few sites (SNPs) varying among individuals. Since each coalescent tree applies to a very short region of genome, maybe as little as one gene, there is less interest in the tree. But they do illuminate the values of parameters such as population size, migration rates, recombination rates etc. This allows us to accumulate information across different loci (genes). To don this we have to sum over our uncertainty about the tree by using MCMC methods, accumulating the information (as log likelihood or using Bayesian machinery) to make inferences about the parameters. This is the interface between within-species population genetics and between-species work on phylogenies. It is also the statistical foundation of inferences from mitochondrial genealogies ( mitochondrial Eve ) and Y chromosome genealogies, and of the samples from the rest of the genome that are now being added to this. p.25/29

49 A coalescent Time p.26/29

50 Yet another kind of tree: trees of gene families Gene duplications in evolution create new genes. Both the new gene and the original one then evolve. Frog Human Monkey Squirrel a b a b a b species boundary gene duplication tree of genes Some forks are gene duplications, leading to subtrees that are all supposed to have the same phylogeny as they are in the same set of species. Example: Hemoglobin proteins. p.27/29

51 Yet another kind of tree: trees of gene families Gene duplications in evolution create new genes. Both the new gene and the original one then evolve. Frog Human Monkey Squirrel a b a b a b Frog Human Monkey Squirrel Human Monkey Squirrel a a a b b b species boundary gene duplication tree of genes These two trees should be identical Some forks are gene duplications, leading to subtrees that are all supposed to have the same phylogeny as they are in the same set of species. Example: Hemoglobin proteins. p.28/29

52 References Felsenstein, J Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts. [Book you and all your friends must rush out and buy] Semple, C. and M. Steel Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications, 24. Oxford University Press. [More rigorous mathematical treatment] Yang, Z Computational Molecular Evolution. Oxford Series in Ecology and Evolution. Oxford University Press, Oxford. [Careful survey of molecular phylogeny methods, from a leader] For a list of 348 phylogeny programs, many available free, see p.29/29