Distance Methods. Distance Methods C G T A. Multiple changes at a single site - hidden changes. Seq 1 AGCGAG Seq 2 GCGGAC

Similar documents
Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Introduction to Phylogenetic Analysis

Arbres formels et Arbre(s) de la Vie

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Molecular Clocks and Tree Dating with r8s and BEAST

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Amino Acids and Their Properties

An introduction to Value-at-Risk Learning Curve September 2003

Phylogenetic Trees Made Easy

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

PHYLOGENETIC ANALYSIS

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Section 1. Logarithms

Bayesian Phylogeny and Measures of Branch Support

Missing data and the accuracy of Bayesian phylogenetics

Pairwise Sequence Alignment

Introduction to Bioinformatics AS Laboratory Assignment 6

Protein Sequence Analysis - Overview -

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Bio-Informatics Lectures. A Short Introduction

Northumberland Knowledge

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Factoring Whole Numbers

Mathematical goals. Starting points. Materials required. Time needed

CALCULATIONS & STATISTICS

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Descriptive Statistics and Measurement Scales

II. DISTRIBUTIONS distribution normal distribution. standard scores

Association Between Variables

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

X = T + E. Reliability. Reliability. Classical Test Theory 7/18/2012. Refers to the consistency or stability of scores

The Central Dogma of Molecular Biology

Additional sources Compilation of sources:

Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation

Chapter 11 Number Theory

Gerry Hobbs, Department of Statistics, West Virginia University

Introduction. Appendix D Mathematical Induction D1

Problem of the Month: Fair Games

1.5 Oneway Analysis of Variance

Day One: Least Common Multiple

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Types of Error in Surveys

A combinatorial test for significant codivergence between cool-season grasses and their symbiotic fungal endophytes

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Principles of Evolution - Origin of Species

Standard Deviation Estimator

Protein Protein Interaction Networks

How To Cluster

Note on growth and growth accounting

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Evolution, Natural Selection, and Adaptation

PAML FAQ... 1 Table of Contents Data Files...3. Windows, UNIX, and MAC OS X basics...4 Common mistakes and pitfalls...5. Windows Essentials...

Module 3: Correlation and Covariance

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

Simple Regression Theory II 2010 Samuel L. Baker

6.4 Normal Distribution

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Common Core State Standards for Mathematics Accelerated 7th Grade

DesCartes (Combined) Subject: Mathematics Goal: Data Analysis, Statistics, and Probability

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

Matching Investment Strategies in General Insurance Is it Worth It? Aim of Presentation. Background 34TH ANNUAL GIRO CONVENTION

z-scores AND THE NORMAL CURVE MODEL

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Building a phylogenetic tree

Simple linear regression

DATA INTERPRETATION AND STATISTICS

Phylogenetic systematics turns over a new leaf

CHAPTER 2 Estimating Probabilities

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Chapter 3 RANDOM VARIATE GENERATION

Big Ideas in Mathematics

jmodeltest (April 2008) David Posada 2008 onwards

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

MATCH Commun. Math. Comput. Chem. 61 (2009)

Chapter 11 Monte Carlo Simulation

UNDERSTANDING THE TWO-WAY ANOVA

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

A Basic Introduction to Missing Data

6 Creating the Animation

Ready, Set, Go! Math Games for Serious Minds

Unit 6 Number and Operations in Base Ten: Decimals

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Betting on Volatility: A Delta Hedging Approach. Liang Zhong

IB Maths SL Sequence and Series Practice Problems Mr. W Name

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Chapter 5: Diffusion. 5.1 Steady-State Diffusion

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Y Chromosome Markers

Fractions. If the top and bottom numbers of a fraction are the same then you have a whole one.

Asexual Versus Sexual Reproduction in Genetic Algorithms 1

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Transcription:

Distance Methods Distance Methods Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply counting the number of differences (sometimes called p distance) may underestimate the amount of change - especially if the sequences are very dissimilar - because of multiple hits To try and get better estimates we use a model which includes parameters which reflect how we think sequences may have evolved Some common models of sequence evolution commonly used in distance analysis: A gamma distribution can be used to model site rate heterogeneity Note that distance models are often based upon some of the same assumptions as the models in ML Jukes Cantor model: assumes all changes equally likely General time reversable model (GTR): assigns different probabilities to each type of change LogDet / Paralinear distance model: was devised to deal with unequal base frequencies in different sequences All of these models include a correction for multiple substitutions at the same site All (except Logdet/paralinear distances) can be modified to include a gamma correction for site rate heterogeneity The simplest model - Jukes & Cantor: d xy = -(3/4) ln (1-4/3 D) d xy = distance between sequence x and sequence y expressed as the number of changes per site (note d xy = r/n where r is number of replacements and n is the total number of sites. This assumes all sites can vary and when unvaried sites are present in two sequences it will underestimate the amount of change which has occurred at variable sites) D = is the observed proportion of nucleotides which differ between two sequences (fractional dissimilarity) ln = natural log function to correct for superimposed substitutions The 3/4 and 4/3 terms reflect that there are four types of nucleotides and three ways in which a second nucleotide may not match a first - with all types of change being equally likely (i.e. unrelated sequences should be 25% identical by chance alone) Multiple changes at a single site - hidden changes Seq 1 Seq 2 Seq 1 AGCGAG Seq 2 GCGGAC Number of changes 1 2 3 C G T A C A 1 1

The natural logarithm ln is used to correct for superimposed changes at the same site If two sequences are 95% identical they are different at 5% or 0.05 (D) of sites thus: d xy = -3/4 ln (1-4/3 0.05) = 0.0517 Note that the observed dissimilarity 0.05 increases only slightly to an estimated 0.0517 - this makes sense because in two very similar sequences one would expect very few changes to have been superimposed at the same site in the short time since the sequences diverged apart However, if two sequences are only 50% identical they are different at 50% or 0.50 (D) of sites thus: d xy = -3/4 ln (1-4/3 0.5) = 0.824 For dissimilar sequences, which may diverged apart a long time ago, the use of ln infers that a much larger number of superimposed changes have occurred at the same site A four taxon problem for us and and are thermophiles and mesophiles, respectively No data suggest that and are specifically related to either us or If all four bacteria are included in an analysis the true tree should place and us together The true tree us Comparison of observed (p) distances between sequences and JC distances for the same sequences using PAUP Uncorrected ("p") distance matrix 0.118 2 4 5 6 2-0.067 0.019 4 0.25186-0.099 5 0.18577 0.16866-0.090 6 0.21077 0.18881 0.19231 - Jukes-Cantor distance matrix 2 4 5 60.142 2-4 0.30689-5 0.21346 0.19106-0.071 0.026 0.116 6 0.24745 0.21751 0.22221-0.102 Both distances give the incorrect tree Note that the JC distances are larger The 16S rrna genes of,, us and Exclude characters command in PAUP - exclude constant sites: Character-exclusion status changed: 859 of 1273 characters excluded Total number of characters now excluded = 859 Number of included characters = 414 Base frequencies command in PAUP: Does the JC model fit these data? Taxon A C G T # sites -------------------------------------------------------------- 0.12319 0.38164 0.38164 0.11353 414 0.23188 0.22222 0.27295 0.27295 414 0.13317 0.35835 0.37530 0.13317 413 0.23188 0.22705 0.26570 0.27536 414 -------------------------------------------------------------- Mean 0.18006 0.29728 0.32387 0.19879 413.75 Distance models can be made more parameter rich to increase their realism 1 It is better to use a model which fits the data than to blindly impose a model on data (use Model Test) The most common additional parameters are: A correction for the proportion of sites which are unable to change A correction for variable site rates at those sites which can change A correction to allow different substitution rates for each type of nucleotide change PAUP will estimate the values of these additional parameters for you Estimation of model parameters using maximum likelihood Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not too wrong. Thus one can obtain a tree using parsimony and then estimate model parameters on that tree. These parameters can then be used in a distance analysis (or a ML analysis). 2

50 changes Parameter estimates using the tree scores command in PAUP* Maximum parsimony tree Use PAUP* tree scores to use ML to estimate over this tree: 1) Proportion of invariant sites 2) Gamma shape parameter for variable sites Tree number 1: -Ln likelihood = 4011.82617 Estimated value of proportion of invariable sites = 0.315477 Estimated value of gamma shape parameter = 0.501485 Distance models can be made more parameter rich to increase their realism 2 0.142 0.071 0.102 0.026 0.116 JC 0.234 0.074 0.136 0.063 0.180 0.269 0.073 0.136 JC -invariant sites + gamma correction for variable sites 0.087 0.200 General Time Reversible (GTR) -inv + gamma The logdet/paralinear distances method Lockhardt et al.(1994) Mol. Biol.Evol.11:605-612 Lake (1994) PNAS 91:1455-1459 (paralinear distances) LogDet/paralinear distances was designed to deal with unequal base frequencies in each pairwise sequence comparison - thus it allows base compositions to vary over the tree! This distinguishes it from the GTR distance model which takes the average base composition and applies it to all comparisons The logdet/paralinear distances method 2 LogDet/paralinear distances assume all sites can vary - thus it is important to remove those sites which cannot change - this can be estimated using ML LogDet/Paralinear Distances d xy = -ln (det F xy ) d xy = estimated distance between sequence x and sequence y ln = natural log function to correct for superimposed substitutions F xy = 4 x 4 (there are four bases in DNA) divergence matrix for seq X & Y - this matrix summarises the relative frequencies of bases in a given pairwise comparison det = is the determinant (a unique mathematical value) of the matrix LogDet - a worked example (from Lockhardt et al. 1994) Sequence B a c g t a 224 5 24 8 Sequence A c 3 149 1 16 g 24 5 230 4 t 5 19 8 175 For sequences A and B, over 900 sequence positions, this matrix summarises pairwise site by site comparisons (it uses the data very efficiently) The matrix Fxy expresses this data as the proportions (e.g. 224/900 = 0.249) of sites: a c g t a.249.006.027.009 Fxy = c.003.166.001.018 g.027.006.256.004 t.006.021.009.194 Dxy = -ln [det Fxy] = -ln [.002] = 6.216 (the LogDet distance between sequences A and B) 3

The logdet/paralinear distances method finds the true tree for us + 0.208 0.111 0.076 0.054 0.162 At last! The logdet/paralinear distances method: advantages Very good for situations where base compositions vary between sequences Even when base compositions do not appear to vary the LogDet/Paralinear distances model performs at least as well as other distance methods A drawback is that it assumes rates are equal for all sites However, a correction whereby a proportion of invariable sites are removed prior to analysis appears to work very well as a rate correction in computer simulations Distances: advantages: Fast - suitable for analysing data sets which are too large for ML A large number of models are available with many parameters - improves estimation of distances Use ML to test the fit of model to data Distances: disadvantages: Only through character based analyses can the history of sites be investigated e,g, most informative positions be inferred. Generally outperformed by Maximum likelihood methods in choosing the correct tree in computer simulations (but LogDet can perform better than ML when base compositions vary) Fitting a tree to pairwise distances Numbers of possible trees for N taxa: For 10 taxa there are 2 x 10 6 unrooted trees For 50 taxa there are 3 x 10 74 unrooted trees How can we find the best tree for the distance data we have? 4

Obtaining a tree using pairwise distances Additive distances: If we could determine exactly the true evolutionary distance implied by a given amount of observed sequence change, between each pair of taxa under study, these distances would have the useful property of additivity and would match a single tree B A perfectly additive tree 0.3 A 0.1 0.1 A B C D A - 0.4 0.4 0.8 B 0.4-0.6 1.0 C 0.4 0.6-0.8 D 0.8 1.0 0.8-0.2 C 0.6 The branch lengths in the matrix and the tree path lengths match perfectly - there is a single unique additive tree D Distance estimates may not make an additive tree Some path lengths are longer and others shorter than appear in the matrix > (0.335) Jukes-Cantor distance matrix Proportion of sites assumed to be invariable = 0.56; identical sites removed proportionally to base frequencies estimated from constant sites only 1 2 4 5 6 1-2 0.38745-4 0.22455 0.47540-5 0.13415 0.27313 0.23615-6 0.27111 0.33595 0.28017 0.28846-0.119 0.079 0.017 0.217 0.057 0.056 0.145 > (0.33) > us (0.218) Obtaining a tree using pairwise distances Stochastic errors will cause deviation of the estimated distances from perfect tree additivity even when evolution proceeds exactly according to the distance model used Poor estimates obtained using an inappropriate model will compound the problem How can we identify the tree which best fits the experimental data from the many possible trees Obtaining a tree using pairwise distances We have uncertain data that we want to fit to a tree and find the optimal value for the adjustable parameters (branching pattern and branch lengths) Use statistics to evaluate the fit of tree to the data (goodness of fit measures) Fitch Margoliash method - a least squares method Minimum evolution method - minimises length of tree Note that neighbor joining while fast does not evaluate the fit of the data to the tree Fitch Margoliash Method 1968: Minimises the weighted squared deviation of the tree path length distances from the distance estimates 5

Tree 1 Fitch Margoliash Method 1968: 0.129 0.207 0.051 0.006 0.059 0.077 0.148 0.132 0.040 0.139 0.076 0.204 0.023 0.058 Optimality criterion = weighted least squares Score of best tree(s) found = 0.12243 (average %SD = 11.663) Tree # 1 2 Wtd. S.S. 0.13817 0.12243 APSD 12.391 11.663 Tree 2 - best Minimum Evolution Method: For each possible alternative tree one can estimate the length of each branch from the estimated pairwise distances between taxa and then compute the sum (S) of all branch length estimates. The minimum evolution criterion is to choose the tree with the smallest S value Minimum Evolution Tree 1 - best 0.081 0.217 0.119 0.058 0.152 0.012 0.053 0.119 0.217 0.057 0.017 0.056 0.079 Optimality criterion = minimum evolution Score of best tree(s) found = 0.68998 Tree # 1 2 ME-score 0.68998 0.69163 0.145 Tree 2 6