Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Similar documents
Amino Acids and Their Properties

Bio-Informatics Lectures. A Short Introduction

Introduction to Phylogenetic Analysis

Introduction to Bioinformatics AS Laboratory Assignment 6

Network Protocol Analysis using Bioinformatics Algorithms

Pairwise Sequence Alignment

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Protein Sequence Analysis - Overview -

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Clone Manager. Getting Started

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Phylogenetic Trees Made Easy

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Bayesian Phylogeny and Measures of Branch Support

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

The Central Dogma of Molecular Biology

Introduction to Bioinformatics 3. DNA editing and contig assembly

Bioinformatics Grid - Enabled Tools For Biologists.

DNA Sequence Alignment Analysis

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Bioinformatics Resources at a Glance

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

BIOINFORMATICS TUTORIAL

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

PHYLOGENETIC ANALYSIS

Worksheet - COMPARATIVE MAPPING 1

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Data for phylogenetic analysis

Computational searches of biological sequences

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

Core Bioinformatics. Degree Type Year Semester

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015.

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Current Motif Discovery Tools and their Limitations

Bayesian coalescent inference of population size history

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Principles of Evolution - Origin of Species

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Core Bioinformatics. Titulació Tipus Curs Semestre Bioinformàtica/Bioinformatics OB 0 1

morephyml User Guide [Version 1.14] August 2011 by Alexis Criscuolo

Genome Explorer For Comparative Genome Analysis

Hierarchical Bayesian Modeling of the HIV Response to Therapy

MAKING AN EVOLUTIONARY TREE

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

MEGA. Molecular Evolutionary Genetics Analysis VERSION 4. Koichiro Tamura, Joel Dudley Masatoshi Nei, Sudhir Kumar

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

MATCH Commun. Math. Comput. Chem. 61 (2009)

CD-HIT User s Guide. Last updated: April 5,

Module 10: Bioinformatics

Guide for Bioinformatics Project Module 3

Molecular Clocks and Tree Dating with r8s and BEAST

Mathematics and Statistics: Apply probability methods in solving problems (91267)

MASCOT Search Results Interpretation

Biological Sequence Data Formats

Systematics - BIO 615

DNA Printer - A Brief Course in sequence Analysis

Some Essential Statistics The Lure of Statistics


To discuss this topic fully, let us define some terms used in this and the following sets of supplemental notes.

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches

An Extension Model of Financially-balanced Bonus-Malus System

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

T cell Epitope Prediction

2.3 Identify rrna sequences in DNA

Methods for big data in medical genomics

A Rough Guide to BEAST 1.4

Statistics in Retail Finance. Chapter 6: Behavioural models

Hidden Markov Models

PhyML Manual. Version 3.0 September 17,

Arbres formels et Arbre(s) de la Vie

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Summary Genes and Variation Evolution as Genetic Change. Name Class Date

11. Analysis of Case-control Studies Logistic Regression

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

REWARD System For Even Money Bet in Roulette By Izak Matatya

Production Functions and Cost of Production

Multiple Sequence Alignment and Analysis: Part I An Introduction to the Theory and Application of Multiple Sequence Analysis.

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs.

Stochastic Gene Expression in Prokaryotes: A Point Process Approach

Multinomial and Ordinal Logistic Regression

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Gaussian Conjugate Prior Cheat Sheet

Detection of changes in variance using binary segmentation and optimal partitioning

NO CALCULATORS OR CELL PHONES ALLOWED

Transcription:

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment

A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need to know... Substitution matrices Phylogenetic trees Multiple sequence alignment We ll start with substitution matrices.

Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of mutation (or log-odds ratio) Derived from multiple sequence alignments Two commonly used matrices: PAM and BLOSUM PAM = percent accepted mutations (Dayhoff) BLOSUM = Blocks substitution matrix (Henikoff)

PAM M Dayhoff, 1978 Evolutionary time is measured in Percent Accepted Mutations, or PAMs One PAM of evolution means 1% of the residues/bases have changed, averaged over all 20 amino acids. To get the relative frequency of each type of mutation, we count the times it was observed in a database of multiple sequence alignments. Based on global alignments Assumes a Markov model for evolution. Margaret Oakley Dayhoff

BLOSUM Henikoff & Henikoff, 1992 Based on database of ungapped local alignments (BLOCKS) Alignments have lower similarity than PAM alignments. BLOSUM number indicates the percent identity level of sequences in the alignment. For example, for BLOSUM62 sequences with approximately 62% identity were counted. Some BLOCKS represent functional units, providing validation of the alignment. Steven Henikoff

Multiple Sequence Alignment A multiple sequence alignment is made using many pairwise sequence alignments

Columns in a MSA have a common evolutionary history a phylogenetic tree for one position in the alignment? N S A By aligning the sequences, we are asserting that the aligned residues in each column had a common ancestor.

A tree shows the evolutionary history of a single position Ancestral characters can be inferred by parsimony analysis. W N W W worm clam bird goat fish 8

Counting mutations without knowing ancestral sequences Naíve way: Assume any of the characters could be the ancestral one. Assume equal distance to the ancestor from each taxon. L K F L K F L K F L K F L K F L K F L K F W W N R L S K K P R L S K K P R L T K K P R L S K K P R L S R K P R L T R K P R L ~ K K P W W N If was the ancestor, then it mutated to a W twice, to N once, and stayed three times.

We could have picked W as the ancestor... L K F L K F L K F L K F L K F L K F L K F W W N R L S K K P R L S K K P R L T K K P R L S K K P R L S R K P R L T R K P R L ~ K K P W W N If W was the ancestor, then it mutated to a four times, to N once, and stayed W once. * *FYI: This is how you draw a phylogenetic tree when the branch order is not known.

Subsitution matrices are symmetrical Since we don't know which sequence came first, we don't know whether w or...is correct. So we count this as one mutation of each type. P(-->W) and P(W-->) are the same number. (That's why we only show the upper triangle) w

Summing the substitution counts We assume the ancestor is one of the observed amino acids, but we don't know which, so we try them all. W W N N N W 3 1 2 one column of a MSA W symmetrical matrix

Next possible ancestor, again. We already counted this, so ignore it. W W N N N W 2 1 2 W

W W N N N W 2 1 W 1

W W N N N W 2 1 W 0

N W W W N N 2 0 0 W

Next... again W W N Counting as the ancestor many times as it appears recognizes the increased likelihood that (the most frequent aa at this position) is the true ancestor. N W N W 1 0 0

W W N (no counts for last seq.) N W N W 0 0 0

o to next column. Continue summing. W W N P P I N P P A Continue doing this for every column in every multiple sequence alignment... N W N W 6 4 8 0 2 TOTAL=21 1

Probability ratios are expressed as log odds Substitutions (and many other things in bioinformatics) are expressed as a "likelihood ratio", or "odds ratio" of the observed data over the expected value. Likelihood and odds are synomyms for Probability. So Log Odds is the log (usually base 2) of the odds ratio. log odds ratio = log 2 (observed/expected )

How do you calculate log-odds? P() = 4/7 = 0.57 Observed probability of -> q = P(->)=6/21 = 0.29 Expected probability of ->, e = 0.57*0.57 = 0.33 odds ratio = q /e = 0.29/0.33 log odds ratio = log 2 (q /e ) If the lod is < 0., then the mutation is less likely than expected by chance. If it is > 0., it is more likely.

Same amino acids, different distribution, different outcome. P()=0.50 e = 0.25 q = 9/42 =0.21 lod = log 2 (0.21/0.25) = 0.2 A W W A N A A s spread over many columns P()=0.50 e = 0.25 q = 21/42 =0.5 lod = log 2 (0.50/0.25) = 1 W A W A W A A s concentrated

Different observations, same expectation P()=0.50, P(W)=0.14 e W = 0.07 q W = 7/42 =0.17 lod = log 2 (0.17/0.07) = 1.3 A W A W N A A and W seen together more often than expected. P()=0.50, P(W)=0.14 e W = 0.07 q = 3/42 =0.07 lod = log 2 (0.07/0.07) = 0 W A W A W A A s and W s not seen together.

In class exercise: et the substitution value for P->Q PQPP QQQP QQPP QPPP QQQP P Q P Q P(P)=, P(Q)= e PQ = q PQ = / = lod = log 2 (q PQ /e PQ ) = sequence alignment database. substitution counts expected (e), versus observed (q) for P->Q

PAM assumes Markovian evolution A Markov process is one where the likelihood of the next "state" depends only on the current state. The inference that evolution is Markovian assumes that base changes (or amino acid changes) occur at a constant rate and depend only on the identity of the current base (or amino acid). transition likelihood / MY current aa -> ->A ->V V->V V->.9946.0002.0021.9932.0001 A V V millions of years (MY)

Markovian evolution is an extrapolation Start with one sequence. One position. Say ly. Wait 1 million years. What amino acids are now found at that position? PAM1 = Wait another million years. PAM1 = But,... PAM1 is just PAM1 PAM1

250 million years? PAM1 PAM1 PAM1 = 250 PAM250 = PAM1 The number after PAM denotes the power to which PAM1 was taken.

NOTE OF CLARIFICATION: PAM does not stand for Plus A Million years (or anything like that). It stands for Percent Accepted Mutations. One PAM1 unit does not correspond to 1 million years of evolution. There is no timescale associated with PAM. PAM1 corresponds to 1% mutations. (or 99% identity). The timescale depends on the species. 28

Differences between PAM and BLOSUM PAM PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1 using an assumed Markov chain. BLOSUM BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with approx 62% identity. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM 62 is the default matrix in BLAST (the database search program). It is tailored for comparisons of moderately distant proteins. Alignment of distant relatives may be more accurate with a different matrix.

PAM250

BLOSUM62

In class exercise: Which substitution matrix favors... PAM250 BLOSUM62 conservation of polar residues conservation of non-polar residues conservation of C, Y, or W polar-to-nonpolar mutations polar-to-polar mutations

Protein versus DNA alignments Are protein alignment better? Protein alphabet = 20, DNA alphabet = 4. Protein alignment is more informative Less chance of homoplasy with proteins. Homology detectable at greater edit distance Protein alignment more informative Better old Standard alignments are available for proteins. Better statistics from.s. alignments. On the other hand, DNA alignments are more sensitive to short evolutionary distances. 33

DNA evolutionary models: P-distance What is the relationship between time and the %identity? p = D L evolutionary time 0 p 1 p is a good measure of time only when p is small. 34

DNA evolutionary models: Poisson correction Corrects for multiple mutations at the same site. Unobserved mutations. p = D L 1-p = e -2rt d P = 2rt dp = -ln(1-p) The fraction unchanged decays according to the Poisson function. In the time t since the common ancestor, 2rt mutations have occurred, where r is the mutation rate (r = genetic drift * selection pressure) evolutionary time 0 p 1 Poisson correction assumes p goes to 1 at t=. Where should it really go? 35

DNA evolutionary models: Jukes-Cantor What is the relationship between true evolutionary distance and p-distance? Prob(mutation in one unit of time) = α Prob(no mutation) = 1-3α A C α << 1. T A 1-3α α α α At time t, fraction identical is q(t). Fraction non-identical is p(t). p(t) + q(t) = 1 In time t+1, each of q(t) positions stays same with prob = 1-3α. C T α α α 1-3α α α α 1-3α α α α 1-3α Prob that both sequences do not mutate = (1-3α) 2 =(1-6α+9α 2 ) (1-6α). ( Since α<<1, we can safely neglect α 2. ) Prob that a mismatch mutates back to an identity = 2αp(t) q(t+1) = q(t)(1-6α) + 2α(1-q(t)) d q(t)/dt q(t+1) - q(t) = 2α - 8α q(t) Integrating: q(t) = (1/4)(1 + 3exp(-8αt)) Solving for d JC = 6αt = -(3/4)ln(1 - (4/3)p), where p is the p-distance. 36

DNA evolutionary models evolutionary time 0 p 1 p-distance Poisson correction Jukes-Cantor In Jukes-Cantor, p limits to p=0.75 at infinite evolutionary distance. 37

Transitions/transversions A C In DNA replication, errors can be transitions (purine for purine, pyrimidine for pyrimidine) or transversions (purine for pyrimidine & vice versa) R = transitions/transversions. R would be 1/2 if all mutations were equally likely. In DNA alignments, R is observed to be about 4. Transitions are greatly favored over transversions. T 38

Jukes-Cantor with correction for transitions/ transversions (Kimura 2-parameter model, dk2p) A C A C T 1-2β-α β β α β 1-2β-α α β β α β β α 1-2β-α β 1-2β-α Split changes (D) into the two types, transition (P) and transversion (Q) p-distance = D/L = P + Q P = transitions/l, Q=transversions/L The the corrected evolutionary distance is... d K2P =-(1/2)ln(1-2P-Q)-(1/4)ln(1-2Q) 39

Further corrections are possible, but... Do we need a nucleotide substitution matrix? A C T A C Additional corrections possible for: Sequence position (gamma) Isochores (C-rich, AT-rich regions) Methylated, not methylated.

Review questions What are we asserting about ancestry by aligning the sequences? What kind of data is used to generate a substitution matrix? What makes the PAM matrix Markovian? When should you align amino acid sequence, when DNA? When is the P-distance an accurate measure of time? Why is time versus p-distance Poisson? Why are transitions more common the transversions? What does Kimura do that Jukes-Cantor does not? 41