Substitution = Mutation followed by Fixation

Similar documents
Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Inference of Large Phylogenetic Trees on Parallel Architectures. Michael Ott

Introduction to Phylogenetic Analysis

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

Molecular Clocks and Tree Dating with r8s and BEAST

morephyml User Guide [Version 1.14] August 2011 by Alexis Criscuolo

Missing data and the accuracy of Bayesian phylogenetics

Phylogenetic systematics turns over a new leaf

jmodeltest (April 2008) David Posada 2008 onwards

PHYLOGENETIC ANALYSIS

Phylogenetic Trees Made Easy

Arbres formels et Arbre(s) de la Vie

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Protein Sequence Analysis - Overview -

Direct Methods for Solving Linear Systems. Matrix Factorization

Estimation of the Number of Nucleotide Substitutions in the Control Region of Mitochondrial DNA in Humans and Chimpanzees

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Computational Reconstruction of Ancestral DNA Sequences

Elementary Matrices and The LU Factorization

Introduction to Bioinformatics AS Laboratory Assignment 6

Bayesian Phylogeny and Measures of Branch Support

PhyML Manual. Version 3.0 September 17,

Circuits 1 M H Miller

Divergence Time Estimation using BEAST v1.7.5

Hidden Markov Models

An Introduction to Phylogenetics

DNA Sequence Alignment Analysis

High Throughput Network Analysis

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

Bio-Informatics Lectures. A Short Introduction

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Chapter 5 Discrete Probability Distribution. Learning objectives

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

The Central Dogma of Molecular Biology

Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective

Pairwise Sequence Alignment

Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation

Keywords: evolution, genomics, software, data mining, sequence alignment, distance, phylogenetics, selection

Comparing Bootstrap and Posterior Probability Values in the Four-Taxon Case

Y Chromosome Markers

6. Cholesky factorization

Network Protocol Analysis using Bioinformatics Algorithms

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Math 312 Homework 1 Solutions

Building a phylogenetic tree

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Amino Acids and Their Properties

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

DNA Printer - A Brief Course in sequence Analysis

A Combinatorial Approach for Determining Phylogenetic Invariants for the General Model

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Solving Systems of Linear Equations

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Data for phylogenetic analysis

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

Lecture 2 Matrix Operations

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Introduction to Matrix Algebra

Inferring Secondary Structure from RNA Alignments and their Trees

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Final Project Report

PAML FAQ... 1 Table of Contents Data Files...3. Windows, UNIX, and MAC OS X basics...4 Common mistakes and pitfalls...5. Windows Essentials...

As time goes by: A simple fool s guide to molecular clock approaches in invertebrates*

Networks in phylogenetic analysis: new tools for population biology

A short guide to phylogeny reconstruction

Core Bioinformatics. Degree Type Year Semester

An experimental study comparing linguistic phylogenetic reconstruction methods *

Detecting signatures of selection from. DNA sequences using Datamonkey

Parallel Algorithm for Dense Matrix Multiplication

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Row Echelon Form and Reduced Row Echelon Form

Genome Explorer For Comparative Genome Analysis

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

Bayesian coalescent inference of population size history

Implementation exercises for the course Heuristic Optimization

Chapter 6. Linear Programming: The Simplex Method. Introduction to the Big M Method. Section 4 Maximization and Minimization with Problem Constraints

Least Squares Estimation

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

A combinatorial test for significant codivergence between cool-season grasses and their symbiotic fungal endophytes

Statistical Models in Data Mining

Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Chapter 9 Experience rating

South East of Process Main Building / 1F. North East of Process Main Building / 1F. At 14:05 April 16, Sample not collected

Classification/Decision Trees (II)

Lecture 1: Systems of Linear Equations

MAKING AN EVOLUTIONARY TREE

1.2 Solving a System of Linear Equations

Dr. Jinyuan (Stella) Sun Dept. of Electrical Engineering and Computer Science University of Tennessee Fall 2010

Approximating the Coalescent with Recombination. Niall Cardin Corpus Christi College, University of Oxford April 2, 2007

Notes on Determinant

Practice Questions 1: Evolution

Nonlinear Programming Methods.S2 Quadratic Programming

Transcription:

GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed by Fixation 5:T C GAAATT 4:A C 1:G A

Likelihood (Prob. of data given model & parameter values) = Likelihood for Site 1 X Likelihood for Site 2 X... X... X... X Likelihood for Site 6 A G G

Probabilistic models of nucleotide change (independently and identically evolving sites) Let q ij be the instantaneous rate of change at a site from nucleotide type i to j Q is matrix of instantaneous rates (Q will have 4 rows and 4 columns because i and j can each by any of 4 nucleotide types) For nucleotide starting as type i at time 0, probability nucleotide is type j at time t is denoted p ij (t). p ij (t) is referred to as a transition probability. Consider a very very small amount of evolutionary time t. When i 6= j, p ij ( t). = q ij t p ii ( t). =1 X j,j6=i q ij t where p ii ( t). =1+q ii t q ii = X j,j6=i q ij (in preceding equations, =. can be replaced by = when the limit as t approaches 0 is taken)

Jukes-Cantor model is simplest model of nucleotide substitution. It assumes sequence positions evolve independently and it assumes that all possible changes at a position are equally likely. Let j be probability a residue is type j. j is called the equilibrium probability of type j. p ij (1) = j For Jukes-Cantor model, j =1/4 for all 4 nucleotide types j. General Time Reversible Model 3 subst. types (transitions, 2 transversion classes) 2 subst. types (transitions vs. transversions) Single Substitution Type Tamura-Nei HKY85 Felsenstein 84 Felsenstein 81 Equal Base Frequencies SYM K3ST Kimura 2 Param. Jukes-Cantor Equal Base Frequencies 3 subst. types (transitions, 2 transversion classes ) 2 subst. types (transitions vs. transversions) Single Substitution Type based on Figure 11 from Swofford et al. Chapter in Molecular Systematics (Sinauer, 2nd ed. 1996)

Rate Matrix for Jukes-Cantor Model F R To O M A C G T A 3µ µ µ µ C µ 3µ µ µ G µ µ 3µ µ T µ µ µ 3µ Note 1: Diagonal matrix elements are rate away from nucleotide type of that row. Note 2: In later slide on Jukes-Cantor model, we write s/3 rather than µ. Rate Matrix for Kimura 2-Parameter Model F R To O M A C G T A α 2β β α β C β α 2β β α G α β α 2β β T β α β α 2β Changes involving only purines (i.e., A and G) or only pyrimidines (i.e., C and T) are transitions. Changes involving one purine and one pyrimidine are transversions.

Rate Matrix for Felsenstein 1981 Model F R To O M A C G T A µ(π C + π G + π T ) µπ C µπ G µπ T C µπ A µ(π A + π G + π T ) µπ G µπ T G µπ A µπ C µ(π A + π C + π T ) µπ T T µπ A µπ C µπ G µ(π A + π C + π G ) Rate Matrix for Hasegawa-Kishino-Yano (a.k.a. HKY or HKY85) Model F R To O M A C G T A µ(π C + κπ G + π T ) µπ C µκπ G µπ T C µπ A µ(π A + π G + κπ T ) µπ G µκπ T G µκπ A µπ C µ(κπ A + π C + π T ) µπ T T µπ A µκπ C µπ G µ(π A + κπ C + π G )

Rate Matrix for General Time Reversible Model F R To O M A C G T A µ(aπ C + bπ G + cπ T ) µaπ C µbπ G µcπ T C µaπ A µ(aπ A + dπ G + eπ T ) µdπ G µeπ T G µbπ A µdπ C µ(bπ A + dπ C + fπ T ) µfπ T T µcπ A µeπ C µfπ G µ(cπ A + eπ C + fπ G ) Time Reversibility is a common property of models of sequence evolution. Time reversibility means that i p ij (t) = j p ji (t) for all i, j, and t. i q ij = j q ji for all i and j. For phylogeny reconstruction, time reversibility means that we cannot (on the basis of sequence data alone) hope to distinguish which of two sequence is ancestral and which is the descendant.

The practical implication of time reversibility for phylogeny reconstruction is that maximum likelihood cannot infer the position of the root of the tree unless additional information information exists (e.g., which taxa are the outgroups) or additional assumptions are made (e.g., a molecular clock). Q will represent matrix of instantaneous rates of change. For general time reversible model, entries of Q are: To From A C G T A (a C + b G + c T ) a c b G c T C a A (a A + d G + e T ) d G e T G b A d C (b A + d C + f T ) f T T c A e C f G (c A + e C + f G) In above matrix: a, b, c, d, e, and f cannot be negative. With any rate matrix (including above), the transition probabilities P (t) can be determined from the rate matrix Q and the amount of evolution t via P (t) =e Qt = I + (Qt) 1! where I is the identity matrix. + (Qt)2 2! + (Qt)3 3! +...,

Computing p ij (t) for the Jukes-Cantor model The Jukes Cantor model assumes that this is how nucleotide substitution occurs: 0. A = G = C = T = 1 4. 1. For each site in the sequence, an event will occur with probability 4 3s per unit evolutionary time. 2. If no event occurs, the residue at the site does not change. 3. If an event occurs, the probability that a residue is type i after the event is i. What is the probability that no event occurs in t units of evolutionary time? 4 (1 3 s) (1 4 3 s) (1 4 3 s)...(1 4 3 s) = (1 4 3 s)t. When 4 3s is close to 0, 1 4 3 s. = e 4 3 s.

Pr (no event) = (1 4 3 s)t. = e 4 3 st. When s is redefined as an instantaneous rate per unit evolutionary time, the approximation becomes an equality: Pr (no event) = e 4 3 st. Pr (at least one event) = 1 Pr (no event) = 1 e 4 3 st. If there have been no events, then the residue cannot possibly have changed after an amount of evolution t. If there has been at least one event, then the residue is type j with probability j. p ii (t) =Pr(noevents)+Pr(atleastoneevent) i = e 4 3 st +(1 e 4 3 st ) i.

For i 6= j, p ij (t) = Pr (at least one event) j = (1 e 4 3 st ) j. Notice that 4 3 s and t appear only as a product. 4 3 s and t cannot be separately estimated. Only their product can be estimated. Note: A generalization of the Jukes Cantor model, the Felsenstein 1981 model does not require A = G = C = T = 1 4. Jukes-Cantor Transition Probabilities Prob. of G at end of branch given A at beginning 0.0 0.05 0.10 0.15 0.20 0.25 0 1 2 3 4 Expected number of changes per site (branch length)

Prob. of A at end of branch given A at beginning 0.4 0.6 0.8 1.0 Jukes-Cantor Transition Probabilities 0 1 2 3 4 Expected number of changes per site (branch length)