Bayesian Phylogeny and Measures of Branch Support

Similar documents
PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Phylogenetic Trees Made Easy

Molecular Clocks and Tree Dating with r8s and BEAST

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

Statistics Graduate Courses

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

Bio-Informatics Lectures. A Short Introduction

More details on the inputs, functionality, and output can be found below.

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Introduction to Bioinformatics AS Laboratory Assignment 6

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Multiple Losses of Flight and Recent Speciation in Steamer Ducks Tara L. Fulton, Brandon Letts, and Beth Shapiro

Learning outcomes. Knowledge and understanding. Competence and skills

A Rough Guide to BEAST 1.4

Introduction to Phylogenetic Analysis

Operational Risk Management: Added Value of Advanced Methodologies

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

PhyML Manual. Version 3.0 September 17,

Parallelization Strategies for Multicore Data Analysis

APPLIED MISSING DATA ANALYSIS

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Imputing Values to Missing Data

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

How To Understand The Theory Of Probability

Sample Size Designs to Assess Controls

An Introduction to Phylogenetics

The HB. How Bayesian methods have changed the face of marketing research. Summer 2004

jmodeltest (April 2008) David Posada 2008 onwards

A Bayesian Antidote Against Strategy Sprawl

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Bayesian Statistics: Indian Buffet Process

Borges, J. L On exactitude in science. P. 325, In, Jorge Luis Borges, Collected Fictions (Trans. Hurley, H.) Penguin Books.

Handling attrition and non-response in longitudinal data

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Comparing Bootstrap and Posterior Probability Values in the Four-Taxon Case

Supervised Learning (Big Data Analytics)

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Bayesian coalescent inference of population size history

Data Partitions and Complex Models in Bayesian Analysis: The Phylogeny of Gymnophthalmid Lizards

STA 4273H: Statistical Machine Learning

PHYLOGENY AND COMPARATIVE METHODS SYMBIOMICS WORKSHOP

Bayesian inference for population prediction of individuals without health insurance in Florida

The Basics of Graphical Models

Regression Modeling Strategies

Protein Sequence Analysis - Overview -

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Bayesian Statistical Analysis in Medical Research

morephyml User Guide [Version 1.14] August 2011 by Alexis Criscuolo

Likelihood: Frequentist vs Bayesian Reasoning

Draft 1, Attempted 2014 FR Solutions, AP Statistics Exam

Bayesian Statistics in One Hour. Patrick Lam

Probability Using Dice

Principles of Data Mining by Hand&Mannila&Smyth

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Objections to Bayesian statistics

The Variability of P-Values. Summary

A short guide to phylogeny reconstruction

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

University of Chicago Graduate School of Business. Business 41000: Business Statistics Solution Key

Model Calibration with Open Source Software: R and Friends. Dr. Heiko Frings Mathematical Risk Consulting

One-year reserve risk including a tail factor : closed formula and bootstrap approaches

Quantitative Methods for Finance

The RAxML Manual

What? So what? NOW WHAT? Presenting metrics to get results

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

Dealing with large datasets

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Tutorial on Markov Chain Monte Carlo

Dating Phylogenies with Sequentially Sampled Tips

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Inference on Phase-type Models via MCMC

HT2015: SC4 Statistical Data Mining and Machine Learning

AP STATISTICS (Warm-Up Exercises)

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Simple Linear Regression Inference

Results from the 2014 AP Statistics Exam. Jessica Utts, University of California, Irvine Chief Reader, AP Statistics

Program description for the Master s Degree Program in Mathematics and Finance

Dirichlet Processes A gentle tutorial

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

SAS Certificate Applied Statistics and SAS Programming

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Model-based Synthesis. Tony O Hagan

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Comparison of Data Mining Techniques used for Financial Data Analysis

Transcription:

Bayesian Phylogeny and Measures of Branch Support <carolin.kosiol@vetmeduni.ac.at>

Bayesian Statistics Imagine we have a bag containing 100 dice of which we know that 90 are fair and 10 are biased. The unfair dice are strongly biased: Imagine that you take one die from the bag and throw it 2 times, obtaining: The problem is: what kind of die did you roll?

Bayesian Statistics The likelihood that this is an unbiased die is: L u = Pr ( unbiased die ) = 1/6 1/6 = 1/36 L b = Pr ( biased die) = 4/21 6/21 = 24/441 Bayesian inferences are based on the posterior probability of a hypothesis: This means that our opinion that the dice is biased changed from 0.1 to 0.179 after observing a four and a six.

Bayes Theorem

Bayes Theorem

Bayes Theorem Bayesian Analysis depends on good priors (weakness and strength of the method)

Likelihood Likelihood is the probability that an hypothesis would have been generated the new observed data. Ignores pre-existing information Bayesian Bayesian Posterior Probability is the probability that an hypothesis is true, given the new observed data AND existing knowledge Considers pre-existing information ( Prior )

How does related to Phylogenetics? Likelihood analysis (e.g. PHYML, RAxML) - Best tree = Maximum likelihood tree (ML tree) - Pool of plausible trees obtained by bootstraping Bayesian analysis (e.g. MrBayes - Best tree = Maximum posterior probability tree (MPP tree) - Pool of plausible trees obtained by Markov Chain- Monte Carlo

Non-parametric bootstrap

Likelihood- (Nonparametric) Bootstrapping Used to generate the pool of plausible trees in ML Resamples CHARACTERS Majority-rule consensus tree A simple way of acertaining clade support 70% boostrap support is strong (rough rule of thumb)

Bayesian: Markov-Chain Monte Carlo Used to generate the pool of plausible trees in Bayesian methods Resamples PARAMETERS (e.g. branch length, transition/transversion bias, base frequencies

Bayesian: Markov-Chain Monte Carlo

Bayesian: Markov-Chain Monte Carlo

Bayesian Markov Chain Monte Carlo Initially the likelihoods will increase rapidly (the first random tree will have a low likelihood, which can be improved with random moves. Eventually, the likelihoods will hit a plateau (once sampled trees are very good, most changes will not lead to improved likelihoods and will be rejected)

Bayesian Markov Chain Monte Carlo Initially the likelihoods will increase rapidly (the first random tree will have a low likelihood, which can be improved with random moves. Burn in Eventually, the likelihoods will hit a plateau (once sampled trees are very good, most changes will not lead to improved likelihoods and will be rejected) -Stationarity

Bayesian Markov Chain Monte Carlo At stationarity, the MCMC method will sample trees in proportion to their posterior probability.

Bayesian Markov Chain Monte Carlo At stationarity, the MCMC method will sample trees in proportion to their posterior probability. Out of this pool of trees, one SAMPLED tree topology will be most representative of the clades found in the whole sample maximum credibility tree Often, people get a majority rule consensus of all sampled trees not the same. Analogous to getting the ML tree versus getting the bootstrap consensus.

Bayesian: Markov-Chain Monte Carlo Used to generate the pool of plausible trees in Bayesian methods Resamples PARAMETERS (e.g. branch length, transition/transversion bias, base frequencies Markov Chain: Trees sampled one after the other, next tree is determined only by current tree (not earlier ones Monte Carlo: Next tree is obtained by a random perturbation of parameters

ML versus Bayesian Likelihood analysis (e.g. PHYML, RAxML) - Best tree = Maximum likelihood tree (ML tree) - Pool of plausible trees obtained by bootstraping (perturbs CHARACTERS) Bayesian analysis (e.g. MrBayes - Best tree = Maximum posterior probability tree (MPP tree) - Pool of plausible trees obtained by Markov Chain- Monte Carlo (perturbs PARAMETERS)

ML versus Bayesian

Discussion session

Process of Phylogenetic Estimation Sequence Data MSA Neighbor joining Parsimony ML Bayesian Algorithm Substitution model HKY + JTT WAG+ F mtrev24 Estimate of phylogeny

Sources of Systematic error Sequence data Substitution mdel Algorithm Estimate of phylogeny Alignment Residues included in analysis that are not related by substitutions Countermeasures Carefully examine and edit MSA - remove regions from analysis that likely to be misaligned

Sources of Systematic error Sequence data Substitution model Algorithm Estimate of phylogeny Model - substitutions may occur very differently from those described by model used in phylogenetic analysis Countermeasures Examine sequences for signs of such model mis-specification E.g check frequencies of residues are similar in all sequences If possible, exclude sequences/residues that seem to to violate the model If not possible, interpret resulting phylogeny critically

Sources of Systematic error Sequence data Substitution model Algorithm Estimate of phylogeny Algorithm - incorporates assumptions about sequence evolution that lead to model mis-specification OR algorithm fails (e.g. ML gets trapped in local maxima) Countermeasures Compare results of different algorithms - if they agree, it s less likely that specific algorithms have failed Run algorithms using different starting conditions (e.g. different initial values for parameters of likelihood model)

Exam Questions: What is the difference between local and global alignment? What does the following dotplot depict? Which differences between sequence A and B? Draw a dot plot which has a n insertion in sequence A in comparison to sequence B. Please write down the following tree topology in NEWICK format. Please draw the tree that is given by the following NEWICK format. What is the difference between orthologs and paralogs? What is the difference between the following two DNA models HKY and a FEL. Why can codon models be used to detect selection? Are the HKY model and the JC model nested? If yes what is the degrees of freedom that should be used for a likelihood ratio test? Describe the difference between boostrap and Bayesian branch support values? Please name the steps in the hierarchal structure of de novo sequencing?