DNA pol RNA pol ARS trna Ribosome DNA mrna Protein Transcription Translation Replication
A B Acceptor stem D-loop T C loop Anticodon loop Variable loop
Relative trna gene copy number 0.0 0.2 0.4 0.6 0.8 2 box codons 3 box codons 4 box codons 6 box codons A A A A A V V A AG AG A AGG A V A L AG AGA L 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Relative codon frequency
i j c a o ac c a c c =[o 1,...,o 64 ] o o C a C a k a L A A 1 ˆr C A c a C a k a o ac L F a f ac r ac a a c a a c a c a
g c = o c o c c C g o ac g ac = o ac a A c C a f ac = o ac c C a o ac f r ac = 1 k a o ac c C a o ac = o ac o a k a r
w ac o ac w ac = c C a o ac
= o o o = c C o c C C o = o e o e o o e e e = a A n a o a k a
o a a o a a k a f f = a A F a ( a, a ) F a a a a a ( ) a ( a, a )= ac c C a f ac,f B(z z all ) = ( ) ( )
= { c C } = (, ) E c = f c e c f c e c = b 1 b 2 b 3 c
L L L L w ac w ac = f ac e ac w f ac e ac = b 1 b 2 b 3 b i i L =( wc(i)) 1 1 L = ( L i=1 L wc(i)) i=1
w ac = ( L w c (i) ) 1 L = ( 1 L w c (i) ) L i=1 i=1 = ( 1 L L i=1 o ac(i)) ( 1 L L i=1 o a, (i)) = ( 1 o c w c ) o c C w c = o c E[o c ] E[o c ] o c c E[o c ] (b 1 b 2 b 3 )
= ( 1 o c C w c ) 1 wc w c = w 0 c w +1 c w +2 c
c w c = f c fc fc fc c = 1 wc o c C w W c c t c W c = t (1 s ct )T ct s ct T ct t c W c W ac wac = W ac c C a W c w n = ( 1 ) o c wc o c C
s ct = U U σ U U U = M( ) M( ) = o! (o c!) c C c C f oc c
o c f c o c o = a A B a o a o o a a o B a B a = (o c e c )2 e c c C a χ 2 o c e c c χ 2 χ 2
χ 2 = 1 o a A o ac k 1 a k c C a 1 a o ac c a k a a χ 2 Z a f ac Z a = o a c C a fac 2 1 o a 1 N a = Z 1 a N a k a N a Z a k K = k K n k N a=k N a=k = 1 N a n k a K k
> Z Z k=3 = 1 ( 2 ( 1) 1 2 +( 1 2 3 Z k=2 3Z k=4 3 ) 1 +( 3 5Z k=6 5 ) 1) = a A N a χ 2
X x 1,x 2,...x k X = I(o c ) c C a o c c C a a I( ) 1 o c 1 0 o =[5, 4, 0, 1] x a p x 2 k 1 [p 1,p 2,p 1 +p 2,p 3,...,p 1 +p 2 +...+p k ] 1 T i+1 = T i n 1 n n n = T n 1 1 T = QΛQ T Λ n = QΛ (n 1) Q T 1
p 1 p 2 p 1 p 1 D 0 p 3 0 0 0 T Compute new state vector s p 2 p 1 +p 2 p T 1 +p 2 c 1 p 3 p 1 +p 3 p 2 +p 3 p 1 +p 2 +p 3 Repeat until end of sequence p 2 p 3 p 1 +p 3 p 2 +p 3 p 1 +p 2 +p 3 Sum the state vector c 2 c 3 s 1 s i+1 s n D 2 k 1 k = D n p z (z) =p x (x) p y (y) = i {k y,...,k} p x (i) p y (k i + 1) k p z p x + p y 1 k 1 x + y 1 x y
n = i o a > 0 i=1 x = a A x a x n nk =1 P (X x),
= a A F a E a F a E a H a (H a )= 2 k a E a = H a (H a ) = H a 2 k a H a a H a = f ac 2 f ac c C a a n H a = p a (c)p a (c c ) ka p a (c c ), i=2 p a (c) c p a (c c ) c c a
(H a )= o 1 k a. E a = (H a) H a (H a ) = 2 k a H a 2 k a = a A F a E a
p c c p c f c = c C f c p c = o + o o + o
=1 2p
= o o o = 1 M a K L a A L M a K M a M a =2 o ac o ac e ac c C a e ac K K = 1 (k a 1) 1 L 2 a A 1 /2 =
S k k S a = 1 k a (k a 1) c C a (r ac 1) 2 r ac k a a = a A F a S a F a 1/18
v(c) c 9 v(c) = (A(c i ),A(c)) i=1 A(c) d
β i i(g) g i E i (c) i c β 1 β 3
w ac w ac = o ac o ac o ac o ac
= 2 G(G 1) G i,j {1 ( (i), (j) )}
(x x)/s x s x / x Normalized mean 1.0 0.5 0.0 0.5 1.0 CAI Fop CBI Nc Coefficient of variation 0.0 0.2 0.4 0.6 0.8 1.0 CAI Fop CBI Nc 0 20 40 60 80 100 GC content 0 20 40 60 80 100 GC content (x x)/s x s x / x
Normalized mean 1.0 0.5 0.0 0.5 1.0 CAI Fop CBI Nc Coefficient of variation 0.0 0.2 0.4 0.6 0.8 1.0 CAI Fop CBI Nc 0 100 200 300 400 500 Length 0 100 200 300 400 500 Length
Normalized mean 1.0 0.5 0.0 0.5 1.0 CAI Fop CBI Nc log CV 4 2 0 2 4 CAI Fop CBI Nc 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of 4 & 6 degenerate codons 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of 4 & 6 degenerate codons d i d i 1 d = 1 2 {( 1 2 )0, ( 1 2 )1, ( 1 2 )2, ( 1 2 )3 } { 8 15, 4 15, 2 15, 1 15 }
Normalized mean 2 1 0 1 2 CAI Fop CBI Nc log CV 4 2 0 2 4 CAI Fop CBI Nc 0.0 0.2 0.4 0.6 0.8 1.0 Degree of codon discrepancy 0.0 0.2 0.4 0.6 0.8 1.0 Degree of codon discrepancy Y = a A F a Y a F a Y a
Normalized mean 2 1 0 1 2 CAI Fop CBI Nc log CV 4 3 2 1 0 CAI Fop CBI Nc 0.0 0.2 0.4 0.6 0.8 1.0 Degree of amino acid discrepancy 0.0 0.2 0.4 0.6 0.8 1.0 Degree of amino acid discrepancy = a A φ F a A φ F a
k s
k s k s d = k s [ ] k d [ ]. dt k s = k d [ ] [ ] k d = 2/t 1 2 k s k d
Transcription [mrna] Translation [Protein] Protein turnover mrna decay k s k d k s k d
A Bias towards reuse (standard deviations) Distance between codons (number of intervening amino acids) B C D
16 A B C 10 20 Frequency 12 8 4 8 6 4 2 16 12 8 4-25 -20-15 -10-5 0 5 10 15 20 25 Standard Deviations -30-20 -10 0 10 20 30 Standard Deviations -30-20 -10 0 10 20 30 Standard Deviations <
<
Slow translation (GFP1) Rapid translation (GFP2) Rapid translation (GFP2) Slow translation (GFP1)
20 15 10 5 0 Alanine normal autocorrelation shuffled within gene shuffled within genome 20 15 10 5 0 Arginine 20 15 10 5 0 Glycine 20 15 10 5 0 Isoleucine 20 Leucine 20 Proline 15 15 10 10 5 5 0 0 20 15 10 5 0 Serine 20 15 10 5 0 Threonine percent deviation from expected 20 15 10 5 0 Valine 0 10 20 30 40 50 distance between codons (number of intervening amino acids) 20 15 10 5 0 All 0 10 20 30 40 50 >
S. cerevisiae 15 C. glabrata 10 D. melanogaster 5 10 5 5 0 10 20 30 0 10 20 30 0 10 20 30 20 A. gossypii 10 A. thaliana 15 H. sapiens 15 10 5 0 10 20 30 5 0 10 20 30 10 5 0 10 20 30 percent deviation from expected 5 0 S. pombe 10 20 30 distance between codons (number of intervening amino acids) 15 10 5 0 C. elegans 10 20 30
61 Anticodon-codon mapping 23-45 20 AA-tRNA charging mrna trnas Amino acids Genetic code
B Anticodon A Anticodon Codon Codon A(I) U A U G C G C U A U A C G C G
π t e E j e ij =1 E T j t ij =1 T α β
π = [π 1,π 2 ] λ = {E,T,π} O P (O λ) P = P (O i λ), i P (O λ) P (λ O) P (M) P (O)
1 t 2 12 t trna AGC trna UGC 11 t 22 t e 21 14 e 21 e 22 e 11 e 12 e 13 e 23 e 24 GCU GCC GCA GCG t π e x i i x i = c i n c c,
n c i n c = r c t r, t r c i c n c =( 1 /4)/( 1 /4 + 1 /3) = 3 /7 x i i i γx 2 i + ɛ i γ ɛ i Z =(X E[X])/σ X
AGC UGC trna 11 5 GCU 58952 1 0 GCC Codon GCA GCG 35580 47988 18336 1 0 0 0 1 1 5 trna 11 R 2 = 0.9995 p = 0.0102 Ala UGC Ala AGC Reading 47988 + 18336 58952 + 35580 = 66324 = 94532 γx 2 + e s X s = i C ij X ij, i j, C ij +1 1 j
Consecutive codon GCU GCC GCA GCG GCU 11.0 1.3-8.7-6.9 Leading codon GCC GCA GCG 0.8-8.2-7.1 6.8-6.2-0.8-6.4 11.5 4.8-1.4 5.4 5.7 s n =(s s)/( s s)
ˆp = r +1 n +1, n r
1.0 Normalized Score 0.5 0.0 CC REG HMM
<
Number of predictions Diffr. to random +/- HMM 428-115 REG 419 +132 205 +26 412 +125 119-168 CC
a b c d e f a a a a a a a a a a a e e e d c b f
ψ ψ
2nd 1st T C A G Val Met Ile Leu Leu Phe T C A G GmAA ncm 5 UmAA m 5 CAA GAG UAG IAU A CAU init CAU IAC ncm 5 UAC CAC Ala Thr Pro Ser IGA ncm 5 UGA CGA AGG ncm 5 UGG IGU ncm 5 UGU CAU IGC ncm 5 UGC His Stop Tyr Glu Asp Lys Asn Gln G A GUG mcm 5 s 2 UUG CUG GUU mcm 5 s 2 UUU CUU GUC mcm 5 s 2 UUC CUC Gly Arg Ser Arg Trp Stop Cys GCA CmCA ICG CCG GCU mcm 5 UCU CCU GCC mcm 5 UCC CCC 2nd 3rd T C A G T C A G T C A G T C A G
Anticodon A G U C Pyr Pur Ile 4-box Gly ψ ψ
All Pairs All x All Comparison Candidate Pairs Formation of Stable Pairs Stable Pairs Verification of Stable Pairs Verified Pairs Clustering of Orthologs Group Pairs Broken Pairs Orthologous Groups >
l ( a 1, a 2 ) >l ( s 1, s 2 ) a 1 a 2 s 1 s 2 l d d + d
Triangle test [%] 100 99.90 99.80 Domain test [%] 100 95 90 Number of orthologous relations Fraction of genes with same number of domains Fraction of genes that pass triangle test 0.5 0.6 0.7 0.8 0.9 1 Length Tolerance 0.3 0.2 0.1 Orthologous relations [10 6 ] l
< l< l
No Tolerance Tolerance Score BBH RBH Distance RSD SP i, i j, j d j d >k σ 2 (d j d ) d i d >k σ 2 (d i d ) d k σ 2 (d j d )=σ 2 (d j )+σ 2 (d ) (d j,d ) k k
A x y 1? y 2 B C z x y 1 y 2 z x y 1 y 2 D z x d > 0 y 1 y 2 d
d d = d + d + d + d d d > 0 k k 1 2 d d d d
90 Fraction of SP passing test [%] 89 88 l = 0.70 l = 0.65 l = 0.60 l = 0.55 1.4 1.6 1.8 2.0 2.2 SP tolerance 2.4 l = 0.50 A B C x 1 y 2 dx 1 z 2 dy 2 z 1 dx 1 z 1 d y2 z 2 x 1 y 2 x 1 x 2 z 1 z 2 y 1 y 2 z 1 z 2 z 1 z 2
k
Fraction of VP passing test [%] 97.2 97.0 96.8 96.6 96.4 96.2 96.0 95.8 l = 0.61, k SP = 1.81 l = 0.72, k SP = 1.67 l = 0.58, k SP = 1.96 0.5 1 1.5 2 2.5 VP tolerance
A 800 w 1 300 B x 1 900 700 400 200 y 2 500 1000 w 1 x 1 z 1 z 2 y 2 z 1 z 2
( n 2)
= Paralogs = Orthologs AP CP SP VP GP BP = SP minus VP Relative Amount [%] 50 40 30 20 10 CP SP VP GP Type of Connection
i,j j i
Number of members 10 5 10 4 10 3 10 2 10 2 Class All Bacteria Firmicutes Eukaryota Archaea Vertebrates Mammalia Group Size Genomes Orthologs Ave. groupsize 550 444 116 72 51 32 25 302596 145255 28109 157302 15622 80123 58982 5.52 7.20 7.67 4.11 4.32 5.46 5.75 Full
Codon 1st position T C A G Codon 2nd position T C A G Phe Leu Leu Ile Met + Init Val Ser Pro Thr Ala Tyr Stop His Gln Asn Lys Asp Glu Cys Stop Trp Arg Ser Arg Gly = 6 box = 4 box = 3 box = 2 box = 1 box T C A G T C A G T C A G T C A G Codon 3rd position
Cysteine Stop Tryptophan Threonine Tyrosine Stop Isoleucine Leucine A UC G U C A G G C U C A G C U A G Methionine Phenylalanine Asparagine A U C G C A U U A Lysine Serine Arginine A U C G G A U C G Glutamine A U C G C G U A A UC U G U C A C U C G G A A G U C A G A A U U G C C G A U G C U A G C U A G C Serine Arginine Alanine Valine Histidine Leucine Aspartic acid Proline Glycine Glutamic acid
1st Position T C A G Genetic code 1 2 3 5 12 13 21 23 1 22 1 6 9 14 15 16 21 22 1 2 3 4 5 9 10 13 14 21 F S Y C T F S Y C C L $ S $ $ Q Y $ W W W W W C W W W A L S $ Q Q L L W G L T P H R T L T P H R C L T P Q R A L T S P Q R G I T N S T I T N S C I M M M M M T K N N N R $ S S G S S A M T K R $ S S G S S G V A D G T V A D G C V A E G A V A E G G T C A G 2nd Position 2nd Position
N N N H N NH 2 NH N N H N O NH 2 N N H NH 2 O NH N H O O S Strong M amino K Keto W Weak W Weak M amino K Keto S Strong Y pyrimidine C Cytosine T Thymine Y pyrimidine R purine A Adenine G Guanine R purine H not-g V not-t D not-c B not-a N any
Electricaly charged side chains Positive Negative Arginine Histidine Lysine Aspartic acid Glutamic acid Polar uncharged side chains Special cases Serine Threonine Asparagine Glutamine Cysteine Selenocysteine Glycine Proline Pyrrolysine Hydrophobic side chains Alanine Valine Leucine Isoleucine Methionine Phenylalanine Tyrosine Tryptophan
ψ
TPI = L - R (L + R = 1) Probability L R Changes 4 Valine A A A R M R R A V C V V C V A R 4 Arginine 5 Alanine Count the number of changes Calculate the distribution of changes
A B C GFP1 GFP1GFP2 GFP2GFP2 GFP2GFP1 TPI construct 2GFP GFP Intensity (arbitrary units) 200 150 100 50 GFP2GFP1 GFP2GFP2 Position on gel GFP 100 200 300 2GFP Velocity ratio correlated vs. anti-correlated 1.5 1.0 0.5 GFP1 GFP1 GFP1 All GFP2 GFP2 GFP2 TPI construct 1 1 2 2
Amino acid sequence MGCANLVSRLENNSRLLNRDLIAVTIGAIVYKDPHAGALRS... Subsequence of consecutive synonymous codons GCA GCA GCT GCG GCC... Observable output sequence 1, 1, 4, 3, 2,... 1 Count matrix of consecutive codon 1 1 1
Alanine Arginine Glutamine trna gene copy number 4 5 6 7 8 9 10 11 GCA R squared = 0.9993 p val= 0.0118 GCT trna gene copy number 0 2 4 6 8 10 R squared = 0.9303 p val= 0.0052 CGT CGG AGG AGA trna gene copy number 0 2 4 6 8 CAG R squared = 0.9755 p val= 0.0706 CAA 65000 75000 85000 95000 Codon frequency Glutamic acid 0 20000 40000 60000 Codon frequency Glycine 40000 60000 80000 Codon frequency Isoleucine trna gene copy number 2 4 6 8 10 12 14 GAG R squared = 0.9962 p val= 0.0277 GAA trna gene copy number 5 10 15 R squared = 0.9238 p val= 0.0257 GGA GGG GGC trna gene copy number 2 4 6 8 10 12 ATA R squared = 1 p val= 0.0016 ATT 60000 100000 140000 Codon frequency Leucine 0e+00 4e+04 8e+04 Codon frequency Lysine 60000 100000 140000 Codon frequency Proline trna gene copy number 0 2 4 6 8 10 R squared = 0.9179 p val= 0.0066 CTA CTC TTG TTA trna gene copy number 6 8 10 12 14 R squared = 0.7774 AAG p val= 0.2165 AAA trna gene copy number 2 4 6 8 10 CCT R squared = 0.9997 p val= 0.0073 CCA 55000 65000 75000 Codon frequency Serine 90000 110000 130000 Codon frequency Threonine 40000 60000 80000 Codon frequency Valine trna gene copy number 0 2 4 6 8 10 R squared = 0.9581 p val= 0.0024 TCA AGC TCG TCT trna gene copy number 0 2 4 6 8 10 R squared = 0.9963 p val= 0.0013 ACA ACG ACT trna gene copy number 2 4 6 8 10 12 14 R squared = 0.9978 p val= 7e 04 GTG GTA GTT 20000 60000 100000 Codon frequency 2e+04 6e+04 1e+05 Codon frequency 4e+04 6e+04 8e+04 1e+05 Codon frequency
GCT GCC GCA GCG Alanine Arginine GCT GCC GCA GCG CGT CGC CGA CGG AGA AGG 11 0.8-8.2-7 1.3 6.8-6.4-1.4-8.7-6.3 11.6 5.4-6.9-0.9 4.8 5.8 CGT CGC CGA CGG AGA AGG 13.4 2.5-2.5-3.7-1.7-6.8 3.4 8.5 2.1 9-8.4-0.1-3.3 4.6 9.3 7.5-7.1 2-0.7 5.3 4.9 8.7-7.1 1.5-3.5-8.8-7 -8 12-0.9-5.5 1.4 3.6 1.8-3 5.1 GGT GGC GGA GGG Glycine GGT GGC GGA GGG 26.8-11.8-17.6-9.2-11.1 7.8 5.2 3.9-17.2 6.2 14 5.2-10.5 3.8 7.3 5.3 CCT CCC CCA CCG Proline CCT CCC CCA CCG 3.7 0.2-2.8-1 1.3 6.7-7 2.8-4.5-7.1 11-3.9 0.5 4.7-6.4 5.3 CTT CTC CTA CTG TTA TTG Leucine Serine CTT CTC CTA CTG TTA TTG TCT TCC TCA TCG AGT AGC 7.9 4.6-1.6-4.1 0.6-4.5 TCT 12.6 6-0.5-2.8-9.7-11 4.3 10.6-1.4 4.4-5.1-4.6 TCC 5.4 7.3-4 -1.7-3.7-5.2-0.2-0.3 4 0.8 0.9-4 TCA -4.3-3.5 9.4 2.3-0.3-4 1 3.1 1.8 9.7-6 -3.6 TCG -4.8 0.4 1.2 6.9-2.9 2.3-0.3-3.1 1.9-3.5 7.4-4.8 AGT -7.1-5.8-1.8-2.4 10.5 9.7-7.7-6.9-4.3-2.5-2.1 15.5 AGC -6.3-6.3-6.4-0.5 9.7 14.5 ACT ACC ACA ACG Threonine Valine ACT ACC ACA ACG GTT GTC GTA GTG 6.1 4.9-6.7-5.6 GTT 9 2.1-6.5-7.5 3.6 7.1-6.5-4.6 GTC 2 7.5-6.8-3.2-5.5-8 8.6 5.6 GTA -6.5-6.6 10.9 4.1-5.8-4.4 5.6 6.1 GTG -7.3-3.6 4.3 9.2
Alanine Arginine Glutamine 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 HMM REG CC HMM REG CC HMM REG CC Glutamic acid Glycine Isoleucine 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 HMM REG CC HMM REG CC HMM REG CC Leucine Lysine Proline 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 HMM REG CC HMM REG CC HMM REG CC Serine Threonine Valine 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 HMM REG CC HMM REG CC HMM REG CC
ψ ψ ψ