CONSTRUCTING PAM MATRICES

Transcription

1 CONSTRUCTING PAM MATRICES WINFRIED JUST In this note we will use several different matrices to describe the same penomenon. So we must be very careful about distinguishing them by names. The standing assumption will be that we have a fixed alphabet of characters {c 1,c 2,...,c n }, and that we are looking at a process (evolution in our case) by which some of these characters mutate into each other. We will assume that for a fixed unit of time T the probability that character c i mutates into character c j is a fixed number m ij. The matrix M =[m ij ] 1 i,j n that lists these probabilities will be referred to as the mutation probability matrix. Note that the mutation probability matrix has two important properties: (1) all entries are nonnegative numbers and (2) the sum of numbers in each row is 1. Thus M is a stochastic matrix. There is another underlying assumption here, namely that the probability that character c i mutates into character c j over a time interval T does not depend on the prior history of the process. Thus we model evolution as a Markov process. In general, a (discrete time) Markov process or Markov chain traces a system through a sequence of steps. At each step the system could be in one of a fixed number of states. In our case, the states would be characters c 1,...,c n. If the system is in state i, then it switches to state j with probability t ij, called the transition probability. Note that the matrix of transition probabilities is exactly the same thing that we called a matrix of mutation probabilities. Here is an interesting observation: If M is a stochastic n n matrix, then there is usually a unique vector [q 1,...,q n ] such that (1) [q 1,...,q n ] M =[q 1,...,q n ] and (2) lim k M k = q 1 q 2... q n q 1 q 2... q n q 1 q 2... q n The operation is matrix multiplication; M k is obtained by multiplying M k times with itself. For our purposes it is not necessary to know the details of how matrix multiplication is performed; it suffices to know that it is a fairly standard operation that can be done on any reasonably powerful graphing calculator or computer algebra system, such as MATLAB. The vector [q 1,...,q n ] in the above observation is called the steady state or equilibrium vector of the process. It has a very intuitive interpretation: The matrix Date: 05/03/05 newpam.tex. 1

2 2 WINFRIED JUST multiplication equation [p 1,...,p n ] M =[r 1,...,r n ] says that if we have a long sequence of characters such that c 1 will be present in proportion p 1, c 2 will be present in proportion p 2,... c n will be present in proportion p n, and we let the sequence evolve for T units of time, then in the resulting sequence, c 1 will be present in proportion r 1, c 2 will be present in proportion r 2,...c n will be present in proportion r n. Thus equation (1) tells us that evolution will not change the proportions of characters as long as these proportions are in the steady state. Equation (2) tells us that if we start at any state and let the process run enough steps, it will spend a proportion of about q i of the time in state i. Alternatively, if we start with a long enough sequence of characters and let it evolve long enough, then the proportions of characters in the sequence will get very close to the numbers in the steady state vector. For this reason, we will refer to the numbers q i in the equilibrium vector as the target frequencies of the process. This leads us to an assumption that often underlies the construction of scoring matrices for sequence comparison and alignment and that is seldom clearly spelled out: It is assumed that the observed character frequencies in the data set that is used to construct the scoring matrix are the target frequencies of the mutation probability matrix for the sequences to be scored. A Markov process is reversible if it looks the same when run forwards or backwards. For example, the process described by matrix (3) below is oviously reversible; the process described by matrix (4) is not. (3) (4) Theorem 1. Let M be a mutation probability matrix with the following target frequencies: [q 1,...,q n ]. Then M describes a reversible Markov process if and only if for all 1 i, j n we have (5) q i m ij = q j m ji. A priori there seems to be no biological reason to assume that molecular evolution is a reversible Markov process. However, the assumption of time-reversibility simplifies the study of molecular evolution (see e.g. [5], page 69), and is being made for this very reason. The assumption is almost certainly wrong, however, as long as we always compare two sequences of different extant organisms, we are really looking at evolutionary time having run backwards from one organism to the last common ancestor and then forwards to the other organism (see below), and since our treatment of the two organisms is symmetric, the assumption of time-reversibility should lead to realistic results. As far as I can see, in the original construction of the PAM matrices in [2], timereversibility was not assumed. The assumptions about the evolutionary process that underlie the construction in [2] are not clearly spelled out in the paper, and the implicit assumptions that I see there do not seem more plausible to me than the assumption of time-reversibility. In [3], Dan Gusfield gives a description of the PAM

3 CONSTRUCTING PAM MATRICES 3 matrices that in his words (page 383) roughly, but not exactly reflects Dayhoff s construction. Gusfield s exposition does assume time-reversibility (implicitly), and I will base the following exposition on Gusfield s description. Now let us define what PAM means. The acronym stands for percent accepted mutation. Ideally, two sequences s and t are defined as being one PAM unit diverged if a series of accepted point mutations (and no indels) has converted s to t with an average of one accepted point-mutation per one hundred sequence positions. The term accepted here means a mutation that was incorporated into the molecular sequence and passed on to its progeny. Note that in very long sequences that are one PAM unit diverged we will see slightly less than 1% character substitutions, since some loci may have undergone multiple mutations. Fortunately, if the distance is only a few PAMs, the discrepancy is very small and can be ignored in the construction of our matrices. In the original construction of PAM matrices, Dayhoff started with pairs of protein sequences that were each at most 15% diverged. Here we will explain the process in a simplified way. While the real families of PAM (and BLOSUM) matrices are used for scoring amino acid sequences, we will construct here a family of baby-pam matrices for nucleotide sequences as a means of illustrating the constrution. Let us assume we have a collection of perfectly aligned sequences of nucleotides such that in each pair we see character substitutions at about 2% of the loci. We may think of each pair of sequences in this family as being 2 PAMs diverged. Now assume that by counting the observed percentages of character substitutions in this family, we get the following character substitution matrix A 2 =[a ij ] 1 i,j 4. (6) A C G T The meaning of the entries in the above sequence is the following: Picking randomly a pair <s,t>of sequence pairs in our collection, and picking randomly a locus k, the probability that s[k] = A = t[k] is0.195; the probability that s[k] = A and t[k] = C is 0.001; etc. Note that the probabilities for pairs of different characters sum up to This is why we say that the sequences are roughly 2 PAM units diverged. Let us also assume that the average C-G content in our collection of sequences is 60%, and that A s are as frequent as T s and C s are as frequent as G s. The first step in our construction of baby-pam matrices will be to reconstruct the mutation probability matrix M 2 that gave rise to the matrix A 2. Note that the letters s[k],t[k] at a given locus in two of our sequences are derived from a letter r[k] of an ancestral sequence r, that will have an evolutionary distance of about 1 PAM from each of the observed sequences s and t. Now here is a beautiful consequence of the assumption that molecular evolution is a time-reversible process: This detail does not matter! We can think of s evolving into t by going back through the ancestral sequence and then moving forward in time to become t, or we can think of t evolving backwards into r and then into s, and the derived mutation probability matrix will always be the same.

4 4 WINFRIED JUST Note that by our other assumption, the target frequencies for the matrix M 2 will be [q 1,q 2,q 3,q 4 ]=[0.2, 0.3, 0.3, 0.2]. Let us now treat s as the ancestral sequence. Then the probability a 12 of s[k] =A and t[k]=c is equal to This probability must be equal to q 1 m 12, where m 12 is the probability that a given A mutates to C. Thus we can calculate m 12 = a 12 /q 1 =0.001/0.2= In general we will have m ij = a ij /q i. These calculations lead to the following matrix M 2 of character mutation probabilities: (7) A C G T Now here comes the big question: How to derive scoring matrices for distantly related sequences from data about closely related sequences? In the PAM model, this problem is solved as follows: Evolutionary changes over a long period happen one generation at a time. Thus if we know the matrix M 2 of character mutation probabilities for sequences that are 2 PAMs apart, then we can construct the matrix M 4 of character mutation probabilities for sequences that are 4 PAMs apart by taking the product M 2 M 2. In general, the matrix M 2k of character mutation probabilities for sequences that are 2k PAMs apart can be obtained by taking the k-th power of M 2 (that is, multiplying M 2 k times by itself). Thus M 2k =(M 2 ) k. For example, the character mutation probability matrix for construction our baby-pam120 will be the matrix M 120 =(M 2 ) 60, which looks as follows: (8) A C G T The character mutation probability matrix for constructing our baby-pam250 will be the matrix M 250 =(M 2 ) 125, which looks as follows: (9) A C G T As a next step in the construction of baby-pam120 we must reconstruct the character substitution matrix A 120 from the character mutation probability matrix M 120. Entry a ij in A 120 will be the probability that in two sequences s and t that have an evolutionary distance of 120 PAM, at a randomly chosen locus k, we will have s[k] =c i and t[k] =c j. If we treat s as the ancestral sequence (as the asumption of time-reversibility allows us to do), then we can get a ij by multiplying q i (the probability of finding c i in position s[k]) by m ij (the probability that c i mutates to c j. In other words, A 120 can be obtained by multiplying the C and G

5 CONSTRUCTING PAM MATRICES 5 rowss of M 120 by 0.3 and the A and T rows of M 120 by 0.2. We get the following matrix A 120 : (10) A C G T Note that the above matrix is slightly inaccurate; in particular, the probabilities for A-G pairs and C-T pairs should be exactly the same. The discrepancy is due to rounding errors in the process of matrix multiplication. (Note that the matrices (8) and (9) suffer from the same defect.) Finally, we are ready to construct the baby-pam scoring matrix S 120 itself. Recall that an entry s ij of the latter matrix should be the log-odds score of comparing the probability of finding (c i,c j ) in a correctly aligned column with the probability of finding (c i,c j ) in randomly aligned sequences. In other words, we should have s ij = log 2 ( aij p ip j ). For example, in our case, s 12 = log 2 ( (.2)(.3)). Thus we get the matrix: (11) A C G T Let us remark that in the real PAM matrices, the scores are multiplied by two and rounded to the nearest integer. Let us look at two very important characteristics of the matrix (11). First let us ask ourselves: If we score an ungapped alignment of two random strings of length m each with matrix (11), what is the expected value of the alignment score? This expected value can be calculated as m times the expected score for a pair of random loci. The formula for the latter is E = s ij q i q j. 1 i,j n In our case, this means we have to calculate the sum (.73)(.2) 2 +(.66)(.2)(.3) + (.15)(.2)(.3) + +(.73)(.2) 2, which is equal to It is not hard to see why the average score for a column in a random alignment should be negative. If the average score for a random character pair were positive, then the scores for random extensions of a local alignment by random letters would tend to rise rather than fall, which would result in a lot of long, completely spurious local alignments. Now let us consider the related entity H = s ij a ij. 1 i,j n This is the average score of a correctly aligned character pair. Since our matrix (11) gives the scores in bits, we can think of H as the average amount of information

6 6 WINFRIED JUST supplied by a correctly aligned character pair. Accordingly, H is known as the (relative) entropy of the scoring matrix. For scoring matrix (11) the entropy is equal to H =(.73)(.06646) + (.66)(.03792) + +(.73)(.06646) =.134. The entropy of a scoring matrix allows us to answer the following question: Given a query sequence of length m and a data base of length l, how many letters in a local alignment will on the average be needed to reach a statistically significant level of alignment? One can show that the score of a significant local alignment should be at least log 2 (m) + log 2 (l) bits. If one correctly aligned character pair contributes on the average H units of information, then a local alignment with statistically significant score should be at least log 2 (m)+log 2 (l) H character pairs long. For example, if we search a nucleotide sequence of length 1, 000 bp against the whole human genome ( bp) using the scoring matrix (11), then a statistically significant local alignment would usually have to be at least (log 2 (1000) + log 2 ( ))/.134 = 242 bp long. Homework 1: Find the mutation probability matrix M 40 that can be derived from the matrix M 2 given above and that corresponds to an evolutionary distance of 40 PAMs. Homework 2: (1) Construct the character substitution matrix A 250 that corresponds to the character mutation probability matrix M 250 given above. (2) Use A 250 to construct the corresponding baby-pam250 scoring matrix S 250 with scores given in bits. (3) Find the expected score for random character mismatches and the entropy of S 250. References [1] S. Altschul. Amino acid substitution matrices from an information-theoretic perspective. J. of Mol. Biol., 219: , 1991 [2] M. O. Dayhoff, R. M. Schwartz and B. C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein sequence and structure, 5: , [3] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press [4] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89: , [5] W.-H. Li. Molecular evolution. Sinauer Associates Department of Mathematics,, Ohio University,, Athens, Ohio 45701, U.S.A. address: just@math.ohiou.edu