Sequence comparison, Part I: Substitution and Scores

Transcription

1 Sequence comparison, Part I: Substitution and Scores David H. Ardell Docent of Bioinformatics

2 Outline of the lecture Convergence and Divergence Similarity and Homology Percent Difference as Evolutionary Distance Mutations and Substitutions Hidden change in sequences Poisson Correction Substitution Matrices Odds, Likelihood Ratios, Log-Likelihoods, Scores Sequence similarity scores Score Matrices: PAM and BLOSSUM DNA Matrices

4 HOMOLOGY: common descent (Darwin, 1859) Original definition: "the same organ in different animals under every variety of form and function." (Owen, 1843). Richard Owen ( ) But: homology need not imply similarity of form nor function because of divergence. Similarity need not imply homology because of convergence.

5 Most Recent Common Ancestor Most Recent Common Ancestor Divergence Convergence

6 Most Recent Common Ancestor Earlier Common Ancestor Most Recent Common Ancestor Divergence Convergence

7 Morphology vs. Sequences GCCACTTT CGCGATCA GAAACGTT CGTGATCG GGCAGTTT CGCGATTT

8 Morphology DNA Sequences GCCACTTT CGCGATCA GGCAGATT CAGGATTT GGCAGATT CAGGATTT Convergence More Common Convergence Very Rare!!

9 Why sequence convergence is rare: Many genotypes code for the same phenotype Development GCCACTTT CGCGATCA Evolution Convergent Phenotype Development GAAACGTT CGTGATCG Divergent Genotype GGCAGATT CAGGATTT

10 The enormity of sequence space: DNA (a = 4) L = 1 A G T C N = L a = 4 1 = 4 K = NL(a 1) = 12

11 The enormity of sequence space: DNA (a = 4) L = 1 L = 2 T A G C A T A A A G A C GA GT GG GC TA TT TG TC C A C T C G C C N = L a = 4 1 = 4 K = NL(a 1) = 12 N = L a = 4 2 = 16

12 The enormity of sequence space: DNA (a = 4) L = 1 L = 2 T A G C A T A A A G A C GA GT GG GC TA TT TG TC C A C T C G C C N = L a = 4 1 = 4 K = NL(a 1) = 12 N = L a = 4 2 = 16 K = NL(a 1) = 96

13 The enormity of sequence space: DNA (a = 4) L = 3 AAA ATA AGA GAA GTA ACA GGA GCA AAG ATG AGGGAG GTG ACG GGG GCG TAA TGA CAA CGA TAG TGG CAG CGG TTA TCACTA CCA TTG TCGCTG CCG AAT AGT GAT GGT AAC AGC GAC GGC ATT GTT ACT GCT ATC GTC ACC GCC TAT TGT CAT CGT TAC TGC CAC CGC TTT TCT CTT CCT TTC TCC CTC CCC N = L a = 4 3 = 64 K = NL(a 1) = 576

14 The enormity of sequence space DNA (a = 4), L = 300: N = L a = x K = NL(a 1) 3.74 x

15 The enormity of sequence space DNA (a = 4), L = 300: N = L a = x K = NL(a 1) 3.74 x The probability of two independent randomly evolving sequences converging over any but very small lengths is infinitesimally small.

16 Similarity implies homology DNA (a = 4), L = 300: N = L a = x K = NL(a 1) 3.74 x The probability of two independent randomly evolving sequences converging over any but very small lengths is infinitesimally small. Sequences more similar than expected from random are therefore inferred to have evolved from a common ancestor.

17 Similarity implies homology for sequences Similar morphologies need not imply homology because of convergence. Similar sequences do imply homology because convergence is improbable. GCCACGTTCGCGATCG GGCAGTCTCGCGATTT

19 Homologous DNA sequences GCCACGTTCGCGATCG GGCAGTCTCGCGATTT

20 Homologous DNA sequences Ancestral sequence GCCACTTTCGCGATCA Significantly similar sequences (such as from a BLAST search) are inferred to have come from a common ancestor GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT GGCAGTTTCGCGATTT Homologous sequences

21 Homologous DNA sequences Ancestral sequence GCCACTTTCGCGATCA T 0 All the differences we see between homologs must have evolved since their diverged GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT GGCAGTTTCGCGATTT T now Homologous sequences

22 Homologous DNA sequences Ancestral sequence GCCACTTTCGCGATCA T 0 GCCACTTTCGCGATCG GCCACTTTCGCGATCA T 1 GCCACGTTCGCGATCG GGCAGTCTCGCGATTT Homologous sequences

23 Homologous DNA sequences Ancestral sequence GCCACTTTCGCGATCA T 0 GCCACTTTCGCGATCG GCCACTTTCGCGATCG GCCACTTTCGCGATTA T 1 T 2 GCCACGTTCGCGATCG GGCAGTCTCGCGATTT Homologous sequences

24 Homologous DNA sequences Ancestral sequence GCCAGTTTCGCGATCT T 0 GCCAGTTTCGCGATCG T 1 GCCAGTTTCGCGATTA T 2 GCCAGGTTCGTGATCG T 3 GCCACGTTCGCGATCG GCCAGTCTCGCGATTA GGCAGTCTCGCGATTT T 4 T 5 T 6 GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT GGCAGTCTCGCGATTT T now Homologous sequences

25 Homologous DNA sequences Ancestral sequence GCCAGTTTCGCGATCT T 0 GCCAGTTTCGCGATCG T 1 GCCAGTTTCGCGATTA T 2 GCCAGGTTCGTGATCG T 3 GCCACGTTCGCGATCG GCCAGTCTCGCGATTA GGCAGTCTCGCGATTT T 4 T 5 T 6 GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT Homologous sequences GGCAGTCTCGCGATTT T now Homologous bases at a site

26 Rate of Evolution: changes per time (or per generation) per sequence and per site. Ancestral sequence GCCACTTTCGCGATCA T 0 time t GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT Homologous sequences GGCAGTTTCGCGATTT T now 6 differences per 16 sites per 2 sequences = (6 / 16) / 2 = 18.75% per time t

27 Why divide by two? to estimate how one sequence changes over time Ancestral sequence GCCAGTTTCGCGATCT T 0 GCCAGTTTCGCGATTA time t GCCAGTCTCGCGATTA GGCAGTCTCGCGATTT GGCAGTCTCGCGATTT T now 3 differences per 16 sites = (3 / 16) = 18.75% per time t

28 We usually don't know ancestral sequences. So we compare sequences to infer evolutionary changes? T 0 time t GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT Homologous sequences GGCAGTTTCGCGATTT T now 6 differences per 16 sites per 2 sequences = (6 / 16) / 2 = 18.75% per time t

29 We usually don't know how much time has passed. So we calculate Evolutionary distance as rate X time.? T 0 time? GCCACGTTCGCGATCG GCCACGTTCGCGATCG GGCAGTCTCGCGATTT Homologous sequences GGCAGTTTCGCGATTT T now 6 differences per 16 sites per 2 sequences = (6 / 16) / 2 = 18.75% divergence

30 There may thus exist a Molecular Evolutionary Clock Zuckerkandl & Pauling (1965) % amino acid differences Divergence between α and β or γ Divergence between β, and γ Approx. duplication dates (mya) from vertebrate fossil records

31 Different protein clocks tick at different rates:

32 Different protein clocks tick at different rates

33 A given large divergence can be attained from a fast rate and short time or a slow rate and a long time

35 Q: What is a substitution? A: A substitution is the fixation of a mutation in a population. It has been accepted by natural selection. Population of 5 individuals at generation t = 1

36 Q: What is a substitution? A: A substitution is the fixation of a mutation in a population. It has been accepted by natural selection. Population of 5 individuals at generation t = 1 t = 2

37 Q: What is a substitution? A: A substitution is the fixation of a mutation in a population. It has been accepted by natural selection. Population of 5 individuals at generation t = 1 t = 2

38 Q: What is a substitution? A: A substitution is the fixation of a mutation in a population. It has been accepted by natural selection. Population of 5 individuals at generation t = 1 t = 2: 2 mutations

39 Q: What is a substitution? A: A substitution is the fixation of a mutation in a population. It has been accepted by natural selection. Population of 5 individuals at generation t = 1 t = 2: 2 mutations t = 3

40 Q: What is a substitution? A: A substitution is the fixation of a mutation in a population. It has been accepted by natural selection. Population of 5 individuals at generation t = 1 t = 2: 2 mutations t = 3 t = 4: 1 substitution

41 Sequence differences between species are often assumed to be substitutions (fixed differences). Ancestor Species 1 Species 2

43 % identity (100 - %differences) underestimates evolutionary divergence! % amino acid differences Approx. duplication dates (mya) from vertebrate fossil records

44 Why Percent Identity (%ID) underestimates evolution The more sequences evolve, the more changes we miss. ANCESTOR

45 Why Percent Identity (%ID) underestimates divergence The more sequences evolve, the more changes we miss. ANCESTOR Multiple changes can hit the same site

46 Why Percent Identity (%ID) underestimates divergence The more sequences evolve, the more changes we miss. ANCESTOR Multiple changes can hit the same site 3 changes, 2 differences

47 Why Percent Identity (%ID) underestimates divergence The more sequences evolve, the more changes we miss. ANCESTOR Multiple changes can hit the same site 3 changes, 2 differences Back changes can undo earlier changes

48 Why Percent Identity (%ID) underestimates divergence The more sequences evolve, the more changes we miss. ANCESTOR Multiple changes can hit the same site 3 changes, 2 differences Back changes can undo earlier changes 4 changes, 1 difference

49 Why Percent Identity (%ID) underestimates divergence The more sequences evolve, the more changes we miss. ANCESTOR Multiple changes can hit the same site 3 changes, 2 differences Back changes can undo earlier changes 4 changes, 1 difference Parallel changes hide evolution 6 changes, 1 difference

51 The Poisson Correction Imagine substitutions raining down on sequences:

55 The Poisson Correction Imagine substitutions raining down on sequences: 1. Want to estimate avg. evolutionary distance λ (number of substitutions per site) from %ID = 100 x (p/n).

56 The Poisson Correction Imagine substitutions raining down on sequences: 1. Want to estimate avg. evolutionary distance λ (number of substitutions per site) from %ID = 100 x (p/n). 2. Assume substitutions occur independently by site and time.

57 The Poisson Correction Imagine substitutions raining down on sequences: 1. Want to estimate avg. evolutionary distance λ (number of substitutions per site) from %ID = 100 x (p/n). 2. Assume substitutions occur independently by site and time. 3. Each site has probability λ/n of mutating at distance λ, where it is assumed that N is large. The average fraction of sites not mutated (p/n) is then: (1 - λ/n) N e λ (for large N).

58 The Poisson Correction Imagine substitutions raining down on sequences: 1. Want to estimate avg. evolutionary distance λ (number of substitutions per site) from %ID = 100 x (p/n). 2. Assume substitutions occur independently by site and time. 3. Each site has probability λ/n of mutating at distance λ, where it is assumed that N is large. The average fraction of sites not mutated (p/n) is then: (1 - λ/n) N e λ (for large N). 4. Therefore, if we see p out of N sites not mutated and assume no back or parallel substitutions, we can estimate λ = ln (p/n).

59 The Poisson Correction Imagine substitutions raining down on sequences: 1. Want to estimate avg. evolutionary distance λ (number of substitutions per site) from %ID = 100 x (p/n). 2. Assume substitutions occur independently by site and time. 3. Each site has probability λ/n of mutating at distance λ, where it is assumed that N is large. The average fraction of sites not mutated (p/n) is then: (1 - λ/n) N e λ (for large N). 4. Therefore, if we see p out of N sites not mutated and assume no back or parallel substitutions, we can estimate λ = ln (p/n). 5. Ex: %ID of 38% implies λ = -ln(0.38) 1. About as many substitutions have occurred as the length of the sequence.

60 Poisson-Corrected Evolutionary Distance vs. %ID Substitutions per site 38%ID = %ID = 0.5 %ID

61 The effect of alphabet size DNA (a = 4) Protein (a = 20) A G T C L = 1 N = L a = 4 1 = 4 K = NL(a 1) = 12 N = L a = 20 1 = 20 K = NL(a 1) = 380 At a given position, randomly evolving proteins are less likely than DNA to mutate back ( revert ) to an earlier state.

62 When should you use the Poisson Correction? DNA (a = 4) Protein (a = 20) A G T C L = 1 N = L a = 4 1 = 4 K = NL(a 1) = 12 N = L a = 20 1 = 20 K = NL(a 1) = 380 The Poisson correction assumes no back or parallel substitutions so it is most appropriate for proteins at short evolutionary distances.

64 Improving the Poisson correction: PAM Amino Acid Substitution Matrices Margaret Dayhoff ( )

65 Improving the Poisson correction: PAM Amino Acid Substitution Matrices Basic idea: 1. Collect a big dataset of alignments of closely related proteins. Margaret Dayhoff ( )

66 Improving the Poisson correction: PAM Amino Acid Substitution Matrices Margaret Dayhoff ( ) Basic idea: 1. Collect a big dataset of alignments of closely related proteins. 2. Count amino acid changes and the total composition of amino acids in the dataset.

67 Improving the Poisson correction: PAM Amino Acid Substitution Matrices Margaret Dayhoff ( ) Basic idea: 1. Collect a big dataset of closely related proteins. 2. Count amino acid changes and the total composition of amino acids in the dataset. 3. Calculate the transition probabilities for any amino acid to substitute to another amino acid after 1% sequence divergence.

68 Improving the Poisson correction: PAM Amino Acid Substitution Matrices Margaret Dayhoff ( ) Basic idea: 1. Collect a big dataset of closely related proteins. 2. Count amino acid changes and the total composition of amino acids in the dataset. 3. Calculate from this the transition probabilities for any amino acid to substitute into any other amino acid after 1% sequence divergence. 4. This defines the PAM1 substitution matrix ( Point Accepted Mutation, where accepted implies by natural selection ).

69 Improving the Poisson correction: PAM Amino Acid Substitution Matrices Margaret Dayhoff ( ) Basic idea: 1. Collect a big dataset of closely related proteins. 2. Count amino acid changes and the total composition of amino acids in the dataset. 3. Calculate from this the transition probabilities for any amino acid to substitute into any other amino acid after 1% sequence divergence. 4. This defines the PAM1 matrix ( Point Accepted Mutation, where accepted implies by natural selection ). 5. Assume that the transition probabilities after N% sequence divergence are given by the N-th power of the PAM1 matrix. Ex: PAM250 = (PAM1) 250

70 Example: part of the PAM15 matrix of Jones, Taylor and Thornton (1998) A R N D C Q E G H I L K... A R N D C Q E G H I J K

71 Assumptions of PAM Substitution Matrices 1. Site Independence: Probability of substitution at a site is independent of amino acids in all other sites in the sequence.

72 Assumptions of PAM Substitution Matrices 1. Site Independence: Probability of substitution at a site is independent of amino acids in all other sites. 2. Markov Property: Probability of substitution at a site depends only on the site s present state, not on its history. The probability of A becoming B at 2% divergence is PAM2(B A) = Σ x PAM1(B x) * PAM1(x A) A. B

73 Assumptions of PAM Substitution Matrices 1. Site Independence: Probability of substitution at a site is independent of amino acids in all other sites. 2. Markov Property: Probability of substitution at a site depends only on the site s present state, not on its history. PAM2 = PAM1*PAM1 = (PAM1) 2 PAM3 = PAM2*PAM1 = (PAM1) 3 PAMn = PAMn-1*PAM1 = (PAM1) n

74 Assumptions of PAM Substitution Matrices 1. Site Independence: Probability of substitution at a site is independent of amino acids in all other sites. 2. Markov Property: Probability of substitution at a site depends only on the site s present state, not on its history. 3. Sufficient Sample Size: Sequence composition is the same as in the alignments used to make the matrix.

75 Assumptions of PAM Substitution Matrices 1. Site Independence: Probability of substitution at a site is independent of amino acids in all other sites. 2. Markov Property: Probability of substitution at a site depends only on the site s present state, not on its history. 3. Sufficient Sample Size: Sequence composition is the same as in the alignments used to make the matrix. 4. Stationarity: The probabilities of substitutions do not change with time.

76 Q: What does PAM % change to a protein mean?

77 Q: What does PAM % change to a protein mean? A: a little less than 82% divergence, i.e. just over 18% ID

78 Part of the PAM250 matrix of Jones, Taylor and Thornton (1998) A R N D C Q E G H I L K... A R N D C Q E G H I J K

79 Part of the PAM1000 matrix of Jones, Taylor and Thornton (1998) A R N D C Q E G H I L K... A R N D C Q E G H I J K PAM matrix transition probabilities converge to the composition of the database used to make them

81 Odds are ratios of probabilities Example 1: Odds of rolling 7 or 11 versus rolling doubles

82 Odds are ratios of probabilities Example 1: Odds of rolling 7 or 11: 2 * [p( )+p( )+p( )+p( )]

83 Odds are ratios of probabilities Example 1: Odds of rolling 7 or 11 versus rolling doubles: 2 * [p( )+p( )+p( )+p( )] p( )+p( )+p( )+p( )+p( )+p( )

84 Odds are ratios of probabilities Example 1: Odds of rolling 7 or 11 versus rolling doubles: 2 * [p( )+p( )+p( )+p( )] p( )+p( )+p( )+p( )+p( )+p( ) = 8 / 36 = 4 : 3 odds 6 / 36

85 Odds are ratios of probabilities Example 1: Odds of rolling 7 or 11 versus rolling doubles: 2 * [p( )+p( )+p( )+p( )] p( )+p( )+p( )+p( )+p( )+p( ) = 8 / 36 = 4 : 3 odds 6 / 36 Example 2: Odds of rolling doubles versus a poker flush : p( )+p( )+p( )+p( )+p( )+p( ) p(5 )+p(5 )+p(5 )+p(5 ) = 6 / : 1 4 * (13/52 * 12/51 * 11/50 * 10/49 * 9/48) odds

86 Odds versus Likelihood Ratios Odds can be made of any probabilities, even over different event spaces: p( )+p( )+p( )+p( )+p( )+p( ) 3030 : 1 odds p(5 )+p(5 )+p(5 )+p(5 )

87 Odds versus Likelihood Ratios: Odds can be made of any probabilities, even over different event spaces: p( )+p( )+p( )+p( )+p( )+p( ) 3030 : 1 odds p(5 )+p(5 )+p(5 )+p(5 ) Likelihood ratios must be made over the same events. Example: The likelihood ratio of the word HELLO in a random sequence of letters with English frequencies, versus uniform freqs.: p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.)

88 Likelihood Ratios Model 1: English Probabilities of letters in English E T I O A N S H R L D U C Y G W M B F P V K X Q J Z p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.)

89 Likelihood Ratios compare the likelihoods of the same event in two different models Model 1: English Probabilities of letters in English E T I O A N S H R L D U C Y G W M B F P V K X Q J Z p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.) Uniform probabilities ( = 1/26) E T I O A N S H R L D U C Y G W M B F P V K X Q J Z Model 2: Uniform

90 Likelihood Ratios compare the likelihoods of the same event in two different models E T I O A N S H R L D U C Y G W M B F P V K X Q J Z p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.)

91 Likelihood Ratios compare the likelihoods of the same event in two different models E T I O A N S H R L D U C Y G W M B F P V K X Q J Z p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.) = * * * * x * * * * x 10-8 = 12.8

92 Likelihood Ratios compare the likelihoods of the same event in two different models E T I O A N S H R L D U C Y G W M B F P V K X Q J Z p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.) = * * * * x * * * * x 10-8 = 12.8 HELLO is about 13 times more likely in a sequence with English letter frequencies than random

93 Independence of elementary events makes calculating compound event likelihoods easy p( HELLO Eng.) p( HELLO Unif.) = p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.) = * * * * x * * * * x 10-8 = x 10-6 * * * * = x 10-8 =

94 Independence of elementary events makes calculating compound event likelihoods easy p( HELLO Eng.) p( HELLO Unif.) p( H Eng.) * p( E Eng.) * p( L Eng.) * p( L Eng.) * p( O Eng.) p( H Unif.) * p( E Unif.) * p( L Unif.) * p( L Unif.) * p( O Unif.) = * * * * x * * * * x 10-8 = x 10-6 * * * * = x * * * * = =

95 Log-Likelihood Ratios let you add instead of multiply (avoiding overflow, etc.) p( HELLO Eng.) p( HELLO Unif.) = x 10-6 * * * * = x * * * * log2(1.4) + log2(3.2) + 2 * log2(1.2) + log2(2.0) log2(12.8)

96 Log-Likelihood Ratios of symbols are called LOD Scores ( LOD stands for Log-Odds ) p( HELLO Eng.) p( HELLO Unif.) = x 10-6 * * * * = x * * * * log2(1.4) + log2(3.2) + 2 * log2(1.2) + log2(2.0) log2(12.8) S( H ) + S( E ) + 2 * S( L ) + S( O ) log2(12.8)

97 Scores Likelihoods !;!<!= > = A positive score means the symbol is more likely in model 1; a negative score means it is more likely in model 2 E T I O A N S H R L D U C Y G W M B F P V K X Q J Z! " # $ % & ' ( ) * +, -. / :

98 Scores Likelihoods !;!<!= > = We use log2 for scores, so +1: an event is twice as likely in model 1 than model 2 1: an event is half as likely in model 1 than model 2 E T I O A N S H R L D U C Y G W M B F P V K X Q J Z O is twice as likely in English than random M is half as likely in English than random! " # $ % & ' ( ) * +, -. / :

100 To score pairwise alignments, elementary events are pairs of amino acids or nucleotides in a column S( ) = S( ) +S( ) +S( ) +S( ) +S( )

101 To score pairwise alignments, elementary events are pairs of amino acids or nucleotides in a column S( ) = S( ) +S( ) +S( ) +S( ) +S( )

102 S( ) = log2 p( p( evolution) chance)

103 S( ) = log2 p( p( evolution) chance) = log2 p( ) p( ) p( )

104 Model 1: Evolution From Substitution Matrices S( ) = log2 p( p( evolution) chance) = log2 p( ) p( ) p( )

105 Model 1: Evolution From Substitution Matrices S( ) = log2 p( p( evolution) chance) = log2 p( ) p( ) p( ) Model 2: Chance Two picks from a Random Urn with Database composition

106 Model 1: Evolution From Substitution Matrices S( ) = log2 p( p( evolution) chance) = log2 p( ) p( ) p( ) Model 2: Chance Two picks from a Random Urn with Database composition Probability of pairs when sliding unrelated sequences past each other

108 Unlike Subst. Matrices, Score Matrices are symmetric Model 1: Evolution S( ) = log2 p( p( evolution) chance) = log2 p( ) p( ) p( ) Model 2: Chance = log2 p( ) p( ) p( )

109 BLOSSUM 62 Score Matrix

110 BLOSSUM 62 Score Matrix Bedell et al Figure 4-3. Amino acid chemical relationships Isoleucine Leucine Phenylalanine

111 PAM vs BLOSSUM PAM Starts from alignments of closely related proteins Builds trees to avoid overcounting related sequences inferred ancestral states are used to estimate transition probabilities Transition probabilities at larger evolutionary distances are extrapolated from those at short distances Larger PAMs model bigger distances (Ex: PAM250 > PAM 100) BLOSSUM Starts from alignments of both closely and distantly related proteins Clusters sequences by single-linkage to avoid overcounting. all pairs in a clustered alignment are used to calculate pair probabilities Transition probabilities at different evolutionary distances are estimated empirically from clusters made at different minimal percent identities Larger BLOSSUMs model shorter distances (from higher %ID clusters) (Ex: BLOSSUM62 > BLOSSUM80)

112 Other Amino Acid Substitution/Score Matrices Some matrices are updates of the original Dayhoff method with more data or some technical refinements Ex: JTT, Jones, Taylor, Thornton Gonnet, Benner and Cohen Some matrices are for specialized kinds or parts of proteins. Ex: JTT transmembrane protein matrix Goldstein secondary structure matrices Some matrices have different assumptions Ex: BLOSSUM does not assume Markov property. Matrices are computed independently from alignments at different % IDs. BLOSSUM matrices are labeled by expected %ID, so BLOSSUM30 > BLOSSUM62, whereas PAM100 < PAM250!!!

114 Matrix models of DNA evolution A G C T A G C T A * α α α G α α α C α α * α T α α α * The Jukes-Cantor model

115 Matrix models of DNA evolution Pools A G C T The Jukes-Cantor model

116 Matrix models of DNA evolution A C G T Flows out A G C T A * α α α G α α α C α α * α The Jukes-Cantor model T α α α *

117 Matrix models of DNA evolution A C G T Flows in A G C T A * α α α G α α α C α α * α The Jukes-Cantor model T α α α *

118 Matrix models of DNA evolution A C G T Because of symmetry, sequences evolve to the uniform base composition (25%A, 25%G, 25%C, 25%T). The Jukes-Cantor model

119 Matrix models of DNA evolution A G C T The Kimura model A G C T A * β α α G β α α C α α * β T α α β *

120 Matrix models of DNA evolution A G C T The Kimura model

121 Matrix models of DNA evolution A G C T The Kimura model

122 Matrix models of DNA evolution A C G T The Kimura model A G C T A * β α α G β α α C α α * β T α α β *