Bayesian Phylogeny and Measures of Branch Support

Bayesian Phylogeny and Measures of Branch Support <carolin.kosiol@vetmeduni.ac.at>

Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The unfair dice are strongly biased: Imagine that you take one die from the bag and throw it 2 times, obtaining: The problem is: what kind of die did you roll?

Bayesian Statistics The likelihood that this is an unbiased die is: L u = Pr ( unbiased die ) = 1/6 1/6 = 1/36 L b = Pr ( biased die) = 4/21 6/21 = 24/441 Bayesian inferences are based on the posterior probability of a hypothesis: This means that our opinion that the dice is biased changed from 0.1 to 0.179 after observing a four and a six.

Bayes Theorem

Bayes Theorem Bayesian Analysis depends on good priors (weakness and strength of the method)

Likelihood Likelihood is the probability that an hypothesis would have been generated the new observed data. Ignores pre-existing information Bayesian Bayesian Posterior Probability is the probability that an hypothesis is true, given the new observed data AND existing knowledge Considers pre-existing information ( Prior )

How does related to Phylogenetics? Likelihood analysis (e.g. PHYML, RAxML) - Best tree = Maximum likelihood tree (ML tree) - Pool of plausible trees obtained by bootstraping Bayesian analysis (e.g. MrBayes - Best tree = Maximum posterior probability tree (MPP tree) - Pool of plausible trees obtained by Markov Chain- Monte Carlo

Non-parametric bootstrap

Likelihood- (Nonparametric) Bootstrapping Used to generate the pool of plausible trees in ML Resamples CHARACTERS Majority-rule consensus tree A simple way of acertaining clade support 70% boostrap support is strong (rough rule of thumb)

Bayesian: Markov-Chain Monte Carlo Used to generate the pool of plausible trees in Bayesian methods Resamples PARAMETERS (e.g. branch length, transition/transversion bias, base frequencies

Bayesian: Markov-Chain Monte Carlo

Bayesian Markov Chain Monte Carlo Initially the likelihoods will increase rapidly (the first random tree will have a low likelihood, which can be improved with random moves. Eventually, the likelihoods will hit a plateau (once sampled trees are very good, most changes will not lead to improved likelihoods and will be rejected)

Bayesian Markov Chain Monte Carlo Initially the likelihoods will increase rapidly (the first random tree will have a low likelihood, which can be improved with random moves. Burn in Eventually, the likelihoods will hit a plateau (once sampled trees are very good, most changes will not lead to improved likelihoods and will be rejected) -Stationarity

Bayesian Markov Chain Monte Carlo At stationarity, the MCMC method will sample trees in proportion to their posterior probability.

Bayesian Markov Chain Monte Carlo At stationarity, the MCMC method will sample trees in proportion to their posterior probability. Out of this pool of trees, one SAMPLED tree topology will be most representative of the clades found in the whole sample maximum credibility tree Often, people get a majority rule consensus of all sampled trees not the same. Analogous to getting the ML tree versus getting the bootstrap consensus.

Bayesian: Markov-Chain Monte Carlo Used to generate the pool of plausible trees in Bayesian methods Resamples PARAMETERS (e.g. branch length, transition/transversion bias, base frequencies Markov Chain: Trees sampled one after the other, next tree is determined only by current tree (not earlier ones Monte Carlo: Next tree is obtained by a random perturbation of parameters

ML versus Bayesian Likelihood analysis (e.g. PHYML, RAxML) - Best tree = Maximum likelihood tree (ML tree) - Pool of plausible trees obtained by bootstraping (perturbs CHARACTERS) Bayesian analysis (e.g. MrBayes - Best tree = Maximum posterior probability tree (MPP tree) - Pool of plausible trees obtained by Markov Chain- Monte Carlo (perturbs PARAMETERS)

ML versus Bayesian

Discussion session

Process of Phylogenetic Estimation Sequence Data MSA Neighbor joining Parsimony ML Bayesian Algorithm Substitution model HKY + JTT WAG+ F mtrev24 Estimate of phylogeny

Sources of Systematic error Sequence data Substitution mdel Algorithm Estimate of phylogeny Alignment Residues included in analysis that are not related by substitutions Countermeasures Carefully examine and edit MSA - remove regions from analysis that likely to be misaligned

Sources of Systematic error Sequence data Substitution model Algorithm Estimate of phylogeny Model - substitutions may occur very differently from those described by model used in phylogenetic analysis Countermeasures Examine sequences for signs of such model mis-specification E.g check frequencies of residues are similar in all sequences If possible, exclude sequences/residues that seem to to violate the model If not possible, interpret resulting phylogeny critically

Sources of Systematic error Sequence data Substitution model Algorithm Estimate of phylogeny Algorithm - incorporates assumptions about sequence evolution that lead to model mis-specification OR algorithm fails (e.g. ML gets trapped in local maxima) Countermeasures Compare results of different algorithms - if they agree, it s less likely that specific algorithms have failed Run algorithms using different starting conditions (e.g. different initial values for parameters of likelihood model)

Exam Questions: What is the difference between local and global alignment? What does the following dotplot depict? Which differences between sequence A and B? Draw a dot plot which has a n insertion in sequence A in comparison to sequence B. Please write down the following tree topology in NEWICK format. Please draw the tree that is given by the following NEWICK format. What is the difference between orthologs and paralogs? What is the difference between the following two DNA models HKY and a FEL. Why can codon models be used to detect selection? Are the HKY model and the JC model nested? If yes what is the degrees of freedom that should be used for a likelihood ratio test? Describe the difference between boostrap and Bayesian branch support values? Please name the steps in the hierarchal structure of de novo sequencing?