ANALYSIS OF A CIRCULAR CODE MODEL



Similar documents
( TUTORIAL. (July 2006)

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core

(A) Microarray analysis was performed on ATM and MDM isolated from 4 obese donors.

10 µg lyophilized plasmid DNA (store lyophilized plasmid at 20 C)

Recurrence. 1 Definitions and main statements

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Introduction to Perl Programming Input/Output, Regular Expressions, String Manipulation. Beginning Perl, Chap 4 6. Example 1

Table S1. Related to Figure 4

UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet

Mutations and Genetic Variability. 1. What is occurring in the diagram below?

DNA Sample preparation and Submission Guidelines

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

The p53 MUTATION HANDBOOK

Inverse PCR & Cycle Sequencing of P Element Insertions for STS Generation

Hands on Simulation of Mutation

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Gene Synthesis 191. Mutagenesis 194. Gene Cloning 196. AccuGeneBlock Service 198. Gene Synthesis FAQs 201. User Protocol 204

Next Generation Sequencing

Gene Finding CMSC 423

Supplementary Information. Binding region and interaction properties of sulfoquinovosylacylglycerol (SQAG) with human

What is Candidate Sampling

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Molecular analyses of EGFR: mutation and amplification detection

Inverse PCR and Sequencing of P-element, piggybac and Minos Insertion Sites in the Drosophila Gene Disruption Project

SERVICES CATALOGUE WITH SUBMISSION GUIDELINES

1 Example 1: Axis-aligned rectangles

Extending Probabilistic Dynamic Epistemic Logic

8 Algorithm for Binary Searching in Trees

pcas-guide System Validation in Genome Editing

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Calculation of Sampling Weights

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

The OC Curve of Attribute Acceptance Plans

Can Auto Liability Insurance Purchases Signal Risk Attitude?

STATISTICAL DATA ANALYSIS IN EXCEL

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Alternative Way to Measure Private Equity Performance

Chapter 9. Applications of probability. 9.1 The genetic code

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Y-chromosome haplotype distribution in Han Chinese populations and modern human origin in East Asians

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Product-Form Stationary Distributions for Deficiency Zero Chemical Reaction Networks

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Traffic-light a stress test for life insurance provisions

A Probabilistic Theory of Coherence

Marine Biology DEC 2004; 146(1) : Copyright 2004 Springer

Part ONE. a. Assuming each of the four bases occurs with equal probability, how many bits of information does a nucleotide contain?

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

The making of The Genoma Music

Support Vector Machines

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Module 6: Digital DNA

This circuit than can be reduced to a planar circuit

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Implied (risk neutral) probabilities, betting odds and prediction markets

Loop Parallelization

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

pcmv6-neo Vector Application Guide Contents

Project Networks With Mixed-Time Constraints

DEFINING %COMPLETE IN MICROSOFT PROJECT

Section 2 Introduction to Statistical Mechanics

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Fixed income risk attribution

Sketching Sampled Data Streams

Efficient Project Portfolio as a tool for Enterprise Risk Management

BERNSTEIN POLYNOMIALS

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Generalizing the degree sequence problem

How To Calculate The Accountng Perod Of Nequalty

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

REGULAR MULTILINEAR OPERATORS ON C(K) SPACES

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Semantic Link Analysis for Finding Answer Experts *

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Section 5.4 Annuities, Present Value, and Amortization

CHAPTER 14 MORE ABOUT REGRESSION

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

A Lyapunov Optimization Approach to Repeated Stochastic Games

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

7 ANALYSIS OF VARIANCE (ANOVA)

On Lockett pairs and Lockett conjecture for π-soluble Fitting classes

Transcription:

ANALYSIS OF A CIRCULAR CODE MODEL Jérôme Lacan and Chrstan J. Mchel * Laboratore d Informatque de Franche-Comté UNIVERSITE DE FRANCHE-COMTE IUT de Belfort-Montbélard 4 Place Tharradn - BP 747 5 Montbélard Cedex, France Tel.: () 8 99 47 4 Emal: jerome.lacan@pu-pm.unv-fcomte.fr * Equpe de Bologe Théorque UNIVERSITE LOUIS PASTEUR STRASBOURG Pôle API Boulevard Sébasten Brant 674 ILLKIRCH - STRASBOURG, FRANCE Tel.: () 9 4 44 6 Emal: mchel@dpt-nfo.u-strasbg.fr * Correspondng author - -

ABSTRACT A crcular code has been dentfed n the proten (codng) genes of both eukaryotes and prokaryotes by usng a statstcal method called Trnucleotde Frequency method (TF method) [Arquès & Mchel, (996) J. Theor. Bol. 8, 45-58]. Recently, a probablstc model based on the nucleotde frequences wth a hypothess of absence of correlaton between successve bases on a DNA strand, has been proposed by Koch & Lehmann [(997) J. Theor. Bol. 89, 7-74] for constructng some partcular crcular codes. Ther nterestng method whch we call here Nucleotde Frequency method (NF method), reveals several lmts for constructng the crcular code observed wth proten genes. - -

INTRODUCTION Ths secton s dvded nto parts. The frst part summarzes the results of the crcular code ( X ) dentfed n the proten genes of both eukaryotes and prokaryotes. The second part recalls the probablstc model of Koch & Lehmann (997) based on the Nucleotde Frequency method (NF method).. The crcular code X The concept of code "wthout commas" ntroduced by Crck et al. (957) for the proten (codng) genes, s a code readable n only one out of three frames. Such a theoretcal code wthout commas, called crcular code n theory of codes (e.g. Béal, 99; Berstel & Perrn, 985), s a partcular set X of trnucleotdes so that a concatenaton (a seres) of trnucleotdes of X, leads to sequences whch cannot be decomposed n another frame wth a concatenaton of trnucleotdes of X. For example, suppose that X s the followng set of trnucleotdes: X={AAC, AAT, ACC, ATC, ATT, CAG, CTC, CTG, GAA, GAC, GAG, GAT, GCC, GGC, GGT, GTA, GTC, GTT, TAC, TTC}. Some trnucleotdes of X are randomly concatenated, for example as follows: CAG,GCC,TTC,AAT,ACC,ACC,CAG,GAA,GAG,GTA,ATT,ACC,AAT,GTA,AAC,TAC,TTC,ACC,ATC The commas between the trnucleotdes show the frame of constructon (readng frame n bology). Suppose now that the commas are lost leadng to the sequence: CAGGCCTTCAATACCACCCAGGAAGAGGTAATTACCAATGTAAACTACTTCACCATC The problem s to retreve the orgnal frame of constructon. There are obvous possbltes: C,AGG,CCT,TCA,ATA,CCA,CCC,AGG,AAG,AGG,TAA,TTA,CCA,ATG,TAA,ACT,ACT,TCA,CCA,TC CA,GGC,CTT,CAA,TAC,CAC,CCA,GGA,AGA,GGT,AAT,TAC,CAA,TGT,AAA,CTA,CTT,CAC,CAT,C CAG,GCC,TTC,AAT,ACC,ACC,CAG,GAA,GAG,GTA,ATT,ACC,AAT,GTA,AAC,TAC,TTC,ACC,ATC If the set X of trnucleotdes s a crcular code, then there s an unque soluton: CAG,GCC,TTC,AAT,ACC,ACC,CAG,GAA,GAG,GTA,ATT,ACC,AAT,GTA,AAC,TAC,TTC,ACC,ATC Ths unque soluton s obtaned by choosng a wndow (suffcently large) n any poston n the sequence and then, to verfy the belongng of the trnucleotdes of the wndow to X: CAGGCCTTCAATACCACCCAGGAAG AGG,TAATTACCAATGTAAACTACTTCACCATC CAGGCCTTCAATACCACCCAGGAAG A,GGT,AAT,TAC,CAA,TGTAAACTACTTCACCATC CAGGCCTTCAATACCACCCAGGAAG AG,GTA,ATT,ACC,AAT,GTA,AAC,TAC,TTC,ACC,ATC, The frst decomposton proposed s rejected mmedately as the frst trnucleotdes AGG n the wndow does not belong to X. The second decomposton proposed s rejected wth a wndow of nucleotdes. Indeed, the frst nucleotde A n the wndow may belong to several trnucleotdes of X, e.g. GTA. The trnucleotdes GGT, AAT and TAC followng A belong to X. The next trnucleotde CAA does not belong to X as the th nucleotde A (from the begnnng of the wndow) dffers from the unque possblty G of CAG belongng to X. The thrd decomposton s the orgnal one as all the trnucleotdes n the wndow belong to X. The orgnal decomposton of the sequence s automatcally deduced. Such a code was proposed by Crck et al. (957) n order to explan how the readng of a seres of nucleotdes n the proten genes could code for the amno acds consttutng the protens. The problems stressed were: why are there more trnucleotdes than amno acds and how to choose the readng frame? Crck et al. (957) have then proposed that only among 64 trnucleotdes code for the amno acds. - -

However, the determnaton of a set of trnucleotdes formng a crcular code X depends on a great number of constrants: () A trnucleotde wth dentcal nucleotdes (AAA, CCC, GGG or TTT) must be excluded from such a code. Indeed, the concatenaton of AAA wth tself does not allow to retreve the readng (orgnal) frame as there are possble decompostons:...aaa,aaa,aaa,...,...a,aaa,aaa,aa... and...aa,aaa,aaa,a... () Two trnucleotdes related to crcular permutaton, e.g. ATC and TCA, must be excluded from such a code. Indeed, the concatenaton of ATC wth tself does not allow the retreval of the readng (orgnal) frame as there are possble decompostons: ATC,ATC,ATC, and A,TCA,TCA,TC Therefore, by excludng AAA, CCC, GGG and TTT and by gatherng the 6 remanng trnucleotdes n classes of trnucleotdes so that, n each class, the trnucleotdes are deduced from each other by crcular permutatons, e.g. ATC, TCA and CAT, a crcular code has only one trnucleotde per class and therefore contans at most trnucleotdes (maxmal crcular code). Ths trnucleotde number s dentcal to the amno acd number leadng to a crcular code assgnng one trnucleotde per amno acd. No set of trnucleotdes leadng to a crcular code has been found at ths tme. Furthermore, the dscoveres that the trnucleotde TTT, an "excluded" trnucleotde n the concept of crcular code, codes for phenylalanne (Nrenberg & Matthae, 96) and that the proten genes are placed n the readng frame wth a partcular trnucleotde, namely the start trnucleotde ATG, have led to gve up the concept of crcular code on the alphabet {A,C,G,T}. For several bologcal reasons, n partcular the nteracton between mrna and trna, the concept of crcular code s resumed later on the alphabet {R,Y} (R=purne=A or G, Y=pyrmdne=C or T) wth trnucleotde models for the prmtve proten genes: RRY (Crck et al., 976) and RNY (N=R or Y) (Egen & Schuster, 978). Unexpectedly, a maxmal crcular code has recently been dentfed n the proten genes of both eukaryotes and prokaryotes on the alphabet {A,C,G,T} (Arquès & Mchel, 996). Ths crcular code has been obtaned by methods: () by computng the occurrence frequences of the 64 trnucleotdes AAA,...,TTT n the frames of proten genes and then, by assgnng each trnucleotde to the frame assocated wth ts hghest frequency (Arquès & Mchel, 996). Ths Trnucleotde Frequency method s called TF method. () by computng the 88 ( 64 ) autocorrelaton functons analysng the probablty that a trnucleotde n any frame occurs any bases N after a trnucleotde n a gven frame of proten genes and then, by classfyng these autocorrelaton functons accordng to ther modulo perodcty for deducng a frame for each trnucleotde (Arquès & Mchel, 997a). The maxmal crcular code dentfed s the set X = {AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC, GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC} of trnucleotdes n frame of proten genes (readng frame). Furthermore, the sets X of trnucleotdes dentfed n the frames and respectvely (frames and beng the frame shfted by and nucleotdes respectvely n the 5'-' drecton) by these methods, are also maxmal crcular codes (Table a). These crcular codes have several mportant propertes: () crcularty: X generates X by one crcular permutaton by another crcular permutaton ( and crcular permutatons of each trnucleotde of X lead to the trnucleotdes of X respectvely) (Table b). - 4 -

() complementarty: X s self-complementary ( trnucleotdes of X are complementary to the other trnucleotdes of X ) and, X are complementary to each other (the trnucleotdes of X are complementary to the trnucleotdes of X ) (Table c). Note that ths property s also verfed wth = { } and, T = X { CCC} and T X { GGG} T X AAA,TTT = (Table c). () rarty: the occurrence probablty of X s equal to 6-8. As there are classes of trnucleotdes (see above), the number of potental crcular codes s =4867844. The computed number of complementary crcular codes wth shfted crcular codes (called C codes), such as X, s 6. Therefore, ts probablty s 6/ =6-8. (v) flexblty: - the lengths of the mnmal wndows to retreve automatcally the frames, and wth the crcular codes X, X respectvely, are all equal to nucleotdes and represent the largest wndow length among the 6 C codes. - the frequency of msplaced trnucleotdes n the shfted frames s equal to 4.6%. If the trnucleotdes of X are randomly concatenated, for example as follows: GAA,GAG,GTA,GTA,ACC,AAT,GTA,CTC,TAC,TTC,ACC,ATC then, the trnucleotdes n frame : G,AAG,AGG,TAG,TAA,CCA,ATG,TAC,TCT,ACT,TCA,CCA,TC and the trnucleotdes n frame : GA,AGA,GGT,AGT,AAC,CAA,TGT,ACT,CTA,CTT,CAC,CAT,C manly belong to X respectvely. A few trnucleotdes are msplaced n the shfted frames. Wth ths example, n frame, 9 trnucleotdes belong to X, trnucleotde (TAC) to X and trnucleotde (TAA) to X. In frame, 8 trnucleotdes belong to X, trnucleotdes (GGT, AAC) to X and trnucleotde (ACT) to X. By computng exactly, the average frequences of msplaced trnucleotdes n frame are.9 % for X and.7 % for X. In frame, the average frequences of msplaced trnucleotdes are.9 % for X and.7 % for X. The complementarty property explans on the one hand that the frequency equalty of X n frames and and on the other hand, the frequency equalty of X n frame n frame. The sum of percentages of msplaced trnucleotdes n frame ( X ) s equal to the sum of percentages of msplaced trnucleotdes n frame ( X ) and s equal to 4.6 %. Ths value s close to the hghest frequency (7.9 %) of msplaced trnucleotdes among the 6 C codes. - the 4 types of nucleotdes occur n the trnucleotde stes wth the crcular codes X, X (Table a). (v) evolutonary: an evolutonary analytcal model at parameters (p,q,t) based on an ndependent mxng of the trnucleotdes of X wth equprobablty (/) followed by t 4 substtutons per trnucleotde accordng to the proportons p., q. and r=-p-q.8 n the trnucleotde stes respectvely, retreves the frequences of X, X observed n the frames of proten genes. The proof that X, X are crcular codes, the detaled explanaton of the propertes (-v) and the dfferent bologcal consequences, n partcular on the -letter genetc alphabets, the genetc code and the amno acd frequences n protens, are gven n Arquès & Mchel (996, 997a). The property (v) s - 5 -

descrbed n Arquès et al. (998, 999). Note: a non-complementary crcular code has recently been dentfed n the mtochondral proten genes (Arquès & Mchel, 997b).. The Nucleotde Frequency method (NF method) Koch & Lehmann (997, p. 7) have recently suggested that the self-complementary crcular code X obseved n proten genes could be explaned by a method for generatng crcular codes from nucleotde frequences. Ths method called here Nucleotde Frequency method (NF method), s brefly recalled by keepng the same notatons. Let p( θ) be the occurrence probablty of a gven base { } θ A,C,G,T at poston { },, n a trnucleotde (trplet) observed n a DNA strand read n frame. By supposng that there s no correlaton between successve bases on a DNA strand, the probablty of fndng the trplet αβγ n the frame s gven by the probabltes product p( α)p( β)p( γ) (ndependent probabltes). The belongng of the trplet αβγ to a preferental set Y of trplets n frame s then equvalent to the followng probablty nequaltes p( α)p()p() β γ > p()p( γ α)p() β () and p( α)p()p() β γ > p()p()p( β γ α) () Smlar probablty nequaltes mply that the trplet βγα (resp. γαβ ) belongs to the preferental set Y (resp. Y ) of trplets n frame (resp. ). Koch & Lehmann (997, p. 7) prove that a preferental set generated from any set of probabltes p( θ) wth ths method, s a crcular code. Koch & Lehmann (997, p. 7) also show that, f the probabltes p( θ) verfy the relaton p( θ ) = p(c( θ)) and p() θ = p(c()) θ () where C( θ) denote the complementary base of θ, then the crcular code Y s necessarly selfcomplementary and the permutated crcular codes Y and Y are complementary (called Arquès & Mchel, 996). C codes n The Table n Koch & Lehmann (997) gves the nucleotde observed frequences p( θ) of a base θ { A,C,G,T } at poston { },, of the readng frame for the prokaryotes. These data have been obtaned from the 44th release of the prokaryotc EMBL database. Ths Table s recalled n ths paper wth the Table a. These probabltes wth the NF method lead to a new crcular code Y ={AAT, AAC, ATT, ATC, ACT, CAC, CTT, CTC, GAA, GAT, GAC, GAG, GTA, GTT, GTC, GTG, GCA, GCT, GCC, GCG}. Ths code Y contans trnucleotdes of the code X (Table a). - 6 -

METHOD AND RESULTS. The Nucleotde Frequency method (NF method) cannot generate the crcular code X.. The NF method does not generate an unque self-complementary crcular code from the observed probabltes The approach of Koch & Lehmann (997) tres to lnk the self-complementary code X and the NF method. However, the code Y obtaned by the NF method from the observed probabltes p( θ) of a base θ { A,C,G,T } at poston { },, of the readng frame for the prokaryotes s not self-complementary as, for example, ACT Y but C(ACT) = AGT Y. So, ths secton s devoted to obtan a self-complementary crcular code wth the NF method from probabltes closed to the observed ones. If the probabltes p( θ) verfy the relaton (), then the crcular code computed by the NF method s a self-complementary code. However, the relaton () whch contans 6 probablty equaltes, cannot be easly used wth observed probabltes. Koch & Lehmann (997, p. 7) have mentoned that the probabltes p( θ) n Table a do not precsely verfy the relaton () and then, no self-complementary crcular code has been proposed. Furthermore, the NF method generates several self-complementary crcular codes f the probabltes of Table a are slghtly modfed for verfyng the relaton (). Three examples of such self-complementary crcular codes are presented n Table b. The frst crcular code s obtaned wth observed frequences from the frst and second columns of Table a: p (A) = p (T) =.76, p(c) = p(g) =.4, p(g) = p(c) =.54, p (T) = p (A) =.66, = = p (A) p (T).85 and = = ( ) p(c) p(g).85. The second crcular code s obtaned wth observed frequences from the second and thrd columns of Table a: p (A) = p (T) =.68, p(c) = p(g) =.4, p(g) = p(c) =.68, p (T) = p (A) =., p (A) p (T).85 and = = ( ) = = p(c) p(g).85. The thrd crcular code s obtaned wth average frequences from Table a: ( ) p (C) = p (G) = (.4 +.4) =., ( ) p (T) = p (A) = (.66 +.) =.94, = = ( + ) = p (C) = p (G) = (.8 +.7) =.. p (A) = p (T) =.76 +.68 =.7, p (G) = p (C) =.54 +.68 =., p (A) p (T).5.85. and In summary, the NF method s not well adapted to reveal an unque self-complementary crcular code. Furthermore, we shall prove n the next secton that the NF method cannot generate the self-complementary crcular code X whch has been dentfed n the proten genes of both eukaryotes and prokaryotes (Arquès & Mchel, 996). - 7 -

.. Proof that the NF method cannot generate the crcular code X Ths secton presents a mathematcal proof that the NF method cannot generate the crcular code X. The dea of ths proof s the followng one. We take the hypothess that a crcular code X contanng the trplets αβγ, δδβ and γαδ where αβγδ,,, { A,C,G,T } s generated by the NF method from the occurrence probabltes p( θ) of a base { } θ A,C,G,T at the poston { },,. Then, ths hypothess s refuted by consderng several probablty nequaltes assocated wth the trplets consdered. As the crcular code X contans such trplets (ATC, GGT, CAG), then X cannot be generated by the NF method. The exstence of probabltes p( θ) generatng X by the NF method s takng as hypothess. Accordng to the nequalty () of the NF method, the trplet αβγ belongng to X leads to the followng probablty nequalty p( α)p()p() β γ > p()p( γ α)p() β (4) Accordng to the nequalty () of the NF method, the trplet δδβ belongng to X leads to the followng probablty nequalty p()p()p() δ δ β > p()p()p() δ β δ (5) Clearly, p( δ ) > otherwse the nequalty (5) cannot be verfed. Therefore, by smplfyng (5) p ( δ)p ( β ) > p ( β)p ( δ) (6) Accordng to the nequalty () of the NF method, the trplet γαδ belongng to X leads to the followng probablty nequalty Clearly, p() δ > p()p( γ α)p() δ > p( α)p()p() δ γ (7) otherwse the nequalty (7) cannot be verfed. By rewrtng (4) as follows p( α)p()p() β γ > p()p( γ α)p() δ p()p() β δ (8) By usng (7) wth the second member of (8), we obtan p ( α)p ( β)p ( γ ) > p ( α)p ( δ)p ( γ ) p ( β) p ( δ) (9) As p( α ) > and p() γ >, the nequalty (9) can be smplfed as follows p() β > p() δ p()p() β δ.e. p ( β)p ( δ ) > p ( δ)p ( β) () The nequalty () s n contradcton wth the nequalty (6). Therefore, the hypothess of exstence of probabltes p( θ) generatng X s refuted. Ths proof can be appled to the crcular code X contanng the trplets ATC, GGT and CAG whch follow the pattern αβγ, δδβ and γαδ. Therefore, the crcular code X cannot be generated by the NF method. - 8 -

.. Development of two algorthms n complement of the proof The prevous secton () has proved that the self-complementary crcular code X cannot be generated by the NF method. Ths secton conssts n determnng all the self-complementary crcular codes whch can be generated by ths NF method. The frst algorthm A developed allows the determnaton of a set S of self-complementary crcular codes Y based on the NF method. The NF method mples the followng property wth each code Y of S Y. The sets of words obtaned by crcular permutatons of a code Y, are complementary crcular codes (Koch & Lehmann, 997, p. 7). Such codes Y are called C codes (Arquès & Mchel, 996). The prncple of the algorthm A conssts n varyng the probabltes p( θ) of the 4 bases at the postons n the range [,] accordng to the relaton (). For each probablty varaton step, the algorthm A computes a C code by usng the NF method and testes whether ths C code has been prevously generated. Indeed, several sets of probabltes p( θ) can lead to the same C codes. By varyng the probabltes p( θ) wth steps becomng smaller and smaller, the number of C codes Y n constant and equal to 88. These 88 codes Y are lsted n Table. S Y remans The algorthm A generates 88 C codes Y. However, the flower automaton method dentfes 6 codes (Arquès & Mchel, 996). In order to explan the 6-88=8 remanng C C codes, we extend the proof () based on the pattern P ={ αβγ, δδβ, γαδ } to ts crcular permuted patterns P ={ βγα, δβδ, αδγ } and P ={ γαβ, βδδ, δγα }. Any crcular code contanng the pattern P cannot be generated by the NF method (proof ()). Smlarly, the proof () also shows that any crcular code contanng a crcular permuted pattern P or P, cannot be generated by the NF method. The algorthm A developed determnes the codes among the 6 ones whch contans at least one of the prevous patterns. There are exactly 8 such C codes. Therefore, the algorthm A confrms the number 88 of C codes Y determned by the algorthm A whatever the probablty varaton step used. C In summary, the number of C codes whch can be generated by the Nucleotde Frequency method s exactly 88. It s mportant to stress that the 8 other C codes cannot be generated from any sets of probabltes, even probabltes whch do not verfy the relaton (), as the proof () does not make any hypothess on the probabltes. - 9 -

. Remarks on the hypothess of no correlaton between successve bases used n the Nucleotde Frequency method (NF method) The hypothess of no correlaton between successve bases has been justfed by the entropy approach (Koch & Lehmann, 997, p. 7). We brefly recall the elementary prncples of the entropy... Method Let X be a dscrete random varable takng the value a { } ( ) = ( = ) A,C,G,T wth the probablty Pa PrX a. The entropy H(X) of the dscrete random varable X can be defned, n a smple approach, by the measure of the average nformaton quantty assocated wth ths varable X,.e. = 4 = H(X) P(a )log P(a ) The entropy H(X) defned for the words of length (nucleotdes) s extended for words w = a...a n, n {,...,4 }, of a gven length n as follows where P(w ) s the occurrence probablty of the word 4 = n H P(w )log P(w ) n = w. Note H = H(X). As the proten genes are read n the readng frame, the entropy extended to the entropy n H, { } w = a...a n the frame f, as follows n,f H n defned for the words of length n s f,,, computed from the occurrence probabltes P(w) f of the word Notes: () For the word where + f 4 = n H P (w )log P (w ) n,f f f = w of length (n=), there s the obvous relaton P(w) = p (w) () f f+ p (w ) s the probablty of a base at the poston + { } (f ),, n the NF method. () H, can be consdered as a classcal entropy H(Y) for the dscrete random varable Y takng the 64 values n {AAA,...,TTT} n readng frame. When the probabltes follow a random dscrete unform law,.e. all the probabltes are equal, then 4 the maxma of the entropy functons H n and H n,f are reached and are equal to n n n = = n log 4 log 4 n 4 bts (Cover & Thomas, 99). Classcally, an entropy functon s expressed n bts per nucleotde wth a maxmal value equal to correspondng to an unform random dstrbuton (Loewenstern & Yanlos,999). Then, the ntroduced functons are normalsed as follows: - -

n H = H n () n,f n H = H n () n,f The two statstcal methods presented n Secton, the Trnucleotde Frequency (TF) method (Arquès & Mchel, 996) and the Nucleotde Frequency (NF) method (Koch & Lehmann, 997), allow to construct crcular codes from data observed n the codng genes. The crcular codes constructed by both methods, are sets of trnucleotdes n frame. The constructon of these dfferent codes are based on the occurrence probabltes of the trplets n frame. The TF method drectly uses these probabltes. In contrast, the NF method assumes the ndependence between successve bases for usng the occurrence probabltes of the bases at the dfferent postons n a trnucleotde (trplet) observed n frame. The computaton of the entropes assocated wth the models of probabltes wll measure the real nfluence of the hypothess of non-correlaton between successve bases. The NF method s based on the occurrence probablty poston { } p( θ) of a gven base θ { A,C,G,T } at,, n a trnucleotde (trplet) observed n frame. By assumng the non-correlaton between successve bases, the occurrence probablty P( αβγ) of the trnucleotde αβγ n frame, s then deduced by the product of ndvdual probabltes whch s equal by usng the relaton () to Then, the entropy P( αβγ ) = P( α)p( β)p( γ ) = p( α)p ( β)p ( γ) H NF assocated to these probabltes s ( ) H = p ( α)p ( β)p ( γ)log p ( α)p ( β)p ( γ) NF αβγ,, {A,C,G,T} By assumng the non-correlaton between successve bases and by usng the relaton (), basc results lead to the entropy H NF equal to (Cover & Thomas, 99) ( ) H = p ( α)p ( β)p ( γ)log p ( α)p ( β)p ( γ) NF αβγ,, {A,C,G,T} = p( θ)logp( θ) = θ {A,C,G,T} 4 = P(w f j)logp(w f j) f= j= = H,f f= The TF method s based on the observed occurrence probabltes of the trnucleotdes n the frame. Therefore, ts entropy s equal to HTF H = P ( αβγ)log P ( αβγ ) = H TF, αβγ,, {A,C,G,T} In order to express the entropes H, accordng to () and () H NF and H TF n bts per nucleotde, the functons are normalzed H = H - -

NF H = H TF NF H = H Remark: Wth gene populatons contanng several mllons of nucleotdes (e.g. Arquès & Mchel, 996; Koch & Lehmann, 997), the computed probabltes are stable (law of large numbers). Therefore, the values obtaned here from such probabltes lead to a precse approxmaton of the entropy functons. TF.. Results The values of these entropes n the prokaryotc proten genes are presented n the Table 4. The values of H (resp. H ) are assocated wth the nucleotdes (resp. the trnucleotdes) wthout consderng the exstence of the readng frame n the prokaryotc proten genes. As expected, these values are closed to representng the random stuaton. The value H (.984 bt per nucleotde) s slghtly less than the value of H (.998 bt per nucleotde), showng that the basc element of nformaton n the proten genes, s the trnucleotde and not the nucleotde. The value of H TF (.98 bt per nucleotde) assocated wth the TF method, s sgnfcantly lower than the value of H NF (.965 bt per nucleotde) assocated wth the NF method. The H TF value can be compared wth the classcal estmate of entropy of codng genes whch s about.9 (Loewenstern & Yanlos,999). Ths value of.9 can be mproved by consderng partcular sequences or by usng specfc algorthms as shown n Table 4 of Loewenstern & Yanlos (999) for a non-redundant collecton of 49 human genes. The mprovement of the estmate of the entropy s not the am of ths paper. But, the fact that the value of H TF corresponds to the classcal estmate, mples that the probablty model used n the TF method can be consdered as a correct representaton of the structure of the codng genes. In contrast, the value of H NF dffers sgnfcantly from the classcal estmate. The hypothess of ndependence between successve bases has then a strong effect on the values of the entropes. Therefore, the probablty model used n the NF method does not reveal nether the nternal structure of the codng genes nor the occurrence probabltes of the trplets n frame. DISCUSSION Koch & Lehmann (997) have proposed a probablstc model for constructng the crcular code observed n the proten genes. Ther method (called here Nucleotde Frequency (NF) method) s based on the nucleotde frequences wth a hypothess of absence of correlaton between successve bases on a DNA strand for deducng a crcular code from the product of the occurrence probabltes of nucleotdes n the postons of trnucleotde read n frame. It allows a smple constructon of some partcular crcular codes but reveals several lmts for constructng the crcular code assocated wth proten genes: () Several self-complementary crcular codes, but not an unque one, are generated by the NF method from the observed probabltes (Secton..). () The self-complementary crcular code X observed n the n the proten genes of both eukaryotes and prokaryotes cannot be generated by the NF method (Secton..). - -

() 88 among 6 self-complementary crcular codes can be generated by the NF method (Secton..). They are lsted n Table. (v) The hypothess used n the NF method of no correlaton between successve bases n the proten genes, s not verfed (Secton..). Indeed, ths hypothess has been justfed by computng the entropy wth occurrence probabltes of words of length to 6 (Koch & Lehmann, 997). However, any probablty model can produced a value of entropy. The choce of the functon for revealng the genetc nformaton n the sense of the nformaton theory defned by Shannon (949), s very mportant as the value of the entropy strongly vares among the functons used. Several examples of dfferent functons estmatng the value of the entropy are presented n Chatzdmtrou-Dresmamm et al (996), Lo et al (996), Loewenstern & Yanlos (999), etc. In order to evaluate the hypothess of non-correlaton between successve bases, estmates of the entropy are computed here. The frst estmate assocated wth the TF method, s based on the 64 occurrence probabltes of trplets n frame. The entropy value H TF assocated wth these probabltes, s equal to.98 bt per nucleotde and s smlar to the classcal estmate (.9) of the entropy of codng genes (Loewenstern & Yanlos,999). The second estmate assocated wth the NF method, s based on the occurrence probabltes of nucleotdes n the trplet stes. These nucleotde probabltes wth the hypothess of non-correlaton between successve bases, allow to deduce the occurrence probabltes of trplets n frame more smply (wth values compared to 64 ones, but wth a probablty hypothess). However, ts entropy value H NF s equal to.965 bt per nucleotde and sgnfcantly dffers from H TF. Therefore, the hypothess of non-correlaton between successve bases s not verfed. 4 CONCLUSION The method ntroduced by Koch & Lehmann (997) s a new approach for constructng crcular codes. Ths NF method constructs n a smple way a sub-set of crcular codes whch s ncluded n the set of crcular codes generated by the flower automaton method. The NF method has an obvous nterest n the feld of the theory of codes. In ths paper, some new results are presented n ths respect, n partcular the number of codes generated by ths NF method and some patterns of code words excluded by the NF method. However, the man purpose of the NF method was to explan the crcular code X dentfed n the proten genes of both eukaryotes and prokaryotes (Arquès & Mchel, 996). Several results were presented here concernng the relatons between the NF method and the code X. The NF method does not generate a unque self-complementary crcular code. Futhermore, t cannot generate the code X. Fnally, the hypothess of non-correlaton between successve bases at the bass of the NF method, s rejected as the dfferent computatons of the entropy clearly show that the probabltes used by the NF method does not respect the nternal structure of the codng genes. In concluson, the NF method s not an approprate model for explanng the crcular code X. - -

REFERENCES Arquès, D.G., Fallot, J.-P., Marsan, L. & Mchel, C.J. (999). An evolutonary analytcal model of a complementary crcular code. J. Bosystems 49, 8-. Arquès, D.G., Fallot, J.-P. & Mchel, C.J. (998). An evolutonary analytcal model of a complementary crcular code smulatng the proten codng genes, the 5' and ' regons. Bull. Math. Bol. 6, 6-94. Arquès, D.G. & Mchel, C.J. (996). A complementary crcular code n the proten codng genes. J. Theor. Bol. 8, 45-58. Arquès, D.G. & Mchel, C.J. (997a). A code n the proten codng genes. J. Bosystems 44, 7-4. Arquès, D.G. & Mchel, C.J. (997b). A crcular code n the proten codng genes of mtochondra. J. Theor. Bol. 89, 7-9. Béal, M.-P. (99). Codage symbolque. Masson. Berstel, J. & Perrn, D. (985). Theory of codes. Academc Press. Chatzdmtrou-Dresmamm, C.A., Sterffer R.M.F. & Larhammar, D. (996). Lack of bologcal sgnfcance n the lngustc features of non-codng DNA. A quanttatve analyss. Nucl. Acds Res. 4, 676-68. Cover, T.M. & Thomas, J.A. (99). Elements of Informaton Theory. Wley. Crck, F.H.C., Brenner, S., Klug, A. & Peczenk, G. (976). A speculaton on the orgn of proten synthess. Orgns of Lfe 7, 89-97. Crck, F.H.C., Grffth, J.S. & Orgel, L.E. (957). Codes wthout commas. Proc. Natl. Acad. Sc. 4, 46-4. Egen, M. & Schuster, P. (978). The hypercycle. A prncple of natural self-organzaton. Part C: The realstc hypercycle. Naturwssenschaften 65, 4-69. Koch, A.J. & Lehmann, J. (997). About a symmetry of the genetc code. J. Theor. Bol. 89, 7-74. Lo, P., Polt A., Buatt, M. & Ruffo, S. (996). Hgh statstcs block entropy measures of DNA sequences. J. Theor. Bol. 8, 5-6. Loewenstern D. & Yanlos P.N. (999). Sgnfcantly lower entropy estmates for natural DNA sequences. J. Comp. Bol. 6, 5-4. Nrenberg, M.W. & Matthae, J.H. (96). The dependance of cell-free proten synthess n E. Col upon naturally occurrng or synthetc polyrbonucleotdes. Proc. Natl. Acad. Sc. 47, 588-6. Shannon, C.E. (949). The mathematcal theory of communcaton. Unversty of Illnos Press. - 4 -

T : AAA AAC AAT ACC ATC ATT CAG CTC CTG GAA GAC GAG GAT GCCGGC GGT GTA GTC GTT TAC TTC TTT T : AAG ACA ACG ACT AGC AGG ATA ATG CCA CCC CCGGCGGTG TAG TCA TCC TCG TCT TGC TTA TTG T : AGA AGT CAA CAC CAT CCT CGA CGC CGG CGT CTA CTT GCA GCT GGAGGG TAA TAT TGA TGG TGT Table a Lst per frame and n lexcographcal order of the trnucleotdes of the complementary crcular code dentfed n proten codng genes of eukaryotes and prokaryotes (Arquès & Mchel, 996). Three subsets of T X AAA,TTT T = X CCC n frame and trnucleotdes can be dentfed: = { } n frame, { } T X { GGG} = n frame. The sets X, X of trnucleotdes are maxmal crcular codes. X : AAC AAT ACC ATC ATT CAG CTC CTG GAA GAC GAG GAT GCC GGC GGT GTA GTC GTT TAC TTC X : ACA ATA CCA TCA TTA AGC TCC TGC AAG ACG AGG ATG CCG GCG GTG TAG TCG TTG ACT TCT X : CAA TAA CAC CAT TAT GCA CCT GCT AGA CGA GGA TGA CGC CGG TGG AGT CGT TGT CTA CTT Table b Crcularty property wth the crcular codes X, X of trnucleotdes dentfed n proten codng genes of eukaryotes and prokaryotes (Table a). T : AAA AAC AAT ACC ATC CAG CTC GAA GAC GCC GTA T : TTT GTT ATT GGT GAT CTG GAG TTC GTC GGC TAC T : AAG ACA ACG ACT AGC AGG ATA ATG CCA CCC CCG GCG GTG TAG TCA TCC TCG TCT TGC TTA TTG T : CTT TGT CGT AGT GCT CCT TAT CAT TGG GGG CGG CGC CAC CTA TGA GGA CGA AGA GCA TAA CAA Table c Complementarty property wth the crcular codes X, X of trnucleotdes dentfed n proten codng genes of eukaryotes and prokaryotes (Table a). Ths property s also verfed wth T (AAA and TTT) and, T and T (CCC and GGG). - 5 -

Base θ p( θ) p() θ p() θ A.76.5. T.66.85.68 C.4.8.68 G.54.7.4 Table a Nucleotde frequences p( θ) at poston {,, } of the readng frame for the prokaryotes (Koch & Lehmann, 997, Table ). Crcular codes p (A) p (C) p (G) p (T) p (A) p (C) AAC AAT ACC AGC ATC ATT CTC GAA GAC GAG GAT GCC GCTGGCGGTGTAGTCGTTTACTTC.76.4.54.66.85.5 AAC AAG AAT ATC ATT CAC CAG CTC CTG CTT GAC GAG GATGCCGGCGTAGTCGTGGTTTAC.68.4.68..85.5 AAC AAG AAT AGC ATC ATT CAC CTC CTT GAC GAG GAT GCCGCTGGCGTAGTCGTGGTTTAC.7...94.. Table b Three self-complementary crcular codes generated by the Nucleotde Frequency method (NF method) wth the frequences of Table a modfed accordng to the relaton (): p(a) = p(t), p(c) = p(g), p(g) = p(c), p(t) p(a), p(a) = p(t) and p(c) = p(g). = - 6 -

Crcular codes p (A) p (C) p (G) p (T) p (A) p (C) ACA AGA CCA CGA GCA GCC GGA GGC GTA TAA TAC TCA TCC TCG TCT TGA TGC TGG TGT TTA.6.6..76.6.44 ACA CCA CGA GAA GCA GCC GGA GGC GTA TAA TAC TCA TCC TCG TGA TGC TGG TGT TTA TTC.6.6..76.8. CAA CCA CGA GAA GCA GCC GGA GGC GTA TAA TAC TCA TCC TCG TGA TGC TGG TTA TTC TTG.6.6..76.. CAA CCA GAA GAC GCA GCC GGA GGC GTA GTC TAA TAC TCA TCC TGA TGC TGG TTA TTC TTG.6.6..76.4.8 CAA CAC CTC GAA GAC GAG GCA GCC GGC GTA GTC GTG TAA TAC TCA TGA TGC TTA TTC TTG.6.6..76.48. CAA CAC GAA GAC GCA GCC GGA GGC GTA GTC GTG TAA TAC TCA TCC TGA TGC TTA TTC TTG.6.6.8.7.4.8 ACA CCA GAA GAC GCA GCC GGA GGC GTA GTC TAA TAC TCA TCC TGA TGC TGG TGT TTA TTC.6.6.4.64.4.6 ATC CAA CAC CTC GAA GAC GAG GAT GCA GCC GGC GTA GTC GTG TAA TAC TGC TTA TTC TTG.6.6..58.48. ACA ACC GAA GAC GCA GCC GGA GGC GGT GTA GTC TAA TAC TCA TCC TGA TGC TGT TTA TTC.6.6.48.4.6.44 AAC ACC GAA GAC GCA GCC GGA GGC GGT GTA GTC GTT TAA TAC TCA TCC TGA TGC TTA TTC.6.6.48.4.4.6 AAC CAC GAA GAC GCA GCC GGA GGC GTA GTC GTG GTT TAA TAC TCA TCC TGA TGC TTA TTC.6.6.48.4.. AAC ATC CAC CTC GAA GAC GAG GAT GCA GCC GGC GTA GTC GTG GTT TAA TAC TGC TTA TTC.6.6.48.4.48. AAC ATC CAC GAA GAC GAT GCA GCC GGA GGC GTA GTC GTG GTT TAA TAC TCC TGC TTA TTC.6.6.54.4.4.8 AAC ACC ATC GAA GAC GAT GCA GCC GGA GGC GGT GTA GTC GTT TAA TAC TCC TGC TTA TTC.6.6.7.6.4.6 AAC ATC CAC CAG CTC CTG GAA GAC GAG GAT GCC GGC GTA GTC GTG GTT TAA TAC TTA TTC.6.6.78..48. AAC AAT ACC AGC ATC ATT GAA GAC GAT GCC GCT GGA GGC GGT GTA GTC GTT TAC TCC TTC.6.6.84.4.6.44 AAC AAT ACC AGC ATC ATT CTC GAA GAC GAG GAT GCC GCT GGC GGT GTA GTC GTT TAC TTC.6.6.84.4.4.6 AAC AAT AGC ATC ATT CAC CTC GAA GAC GAG GAT GCC GCT GGC GTA GTC GTG GTT TAC TTC.6.6.84.4.. AAC AAT ATC ATT CAC CAG CTC CTG GAA GAC GAG GAT GCC GGC GTA GTC GTG GTT TAC TTC.6.6.84.4.48. ACA AGA CCA CCG CGA CGG CTA GCA GGA TAA TAG TCA TCC TCG TCT TGA TGC TGG TGT TTA.6..6.76.6.44 AGA CAA CCA CCG CGA CGG CTA GCA GGA TAA TAG TCA TCC TCG TCT TGA TGC TGG TTA TTG.6..6.76.8. CAA CCA CCG CGA CGG CTA GAA GCA GGA TAA TAG TCA TCC TCG TGA TGC TGG TTA TTC TTG.6..6.76.. CAA CAG CCA CCG CGA CGG CTA CTG GAA GGA TAA TAG TCA TCC TCG TGA TGG TTA TTC TTG.6..6.76.4.8 CAA CAC CAG CCG CGA CGG CTA CTC CTG GAA GAG GTG TAA TAG TCA TCG TGA TTA TTC TTG.6..6.76.48. CAA CAC CAG CTC CTG GAA GAC GAG GCC GGC GTA GTC GTG TAA TAC TCA TGA TTA TTC TTG.6..8.64.48. ATC CAA CAC CAG CTC CTG GAA GAC GAG GAT GCC GGC GTA GTC GTG TAA TAC TTA TTC TTG.6..4.58.48. CAA CAG CCA CCG CGA CGG CTA CTC CTG GAA GAG TAA TAG TCA TCG TGA TGG TTA TTC TTG.6.8.6.7.4.8 CAA CAC CAG CCG CGG CTA CTC CTG GAA GAC GAG GTC GTG TAA TAG TCA TGA TTA TTC TTG.6.8..64.48. AGA CAA CAG CCA CCG CGA CGG CTA CTG GGA TAA TAG TCA TCC TCG TCT TGA TGG TTA TTG.6.4.6.64.4.6 ATG CAA CAC CAG CAT CCG CGG CTA CTC CTG GAA GAC GAG GTC GTG TAA TAG TTA TTC TTG.6.4..58.48. AAC AAT ACC ACT AGC AGT ATC ATT GAA GAC GAT GCC GCT GGA GGC GGT GTC GTT TCC TTC.6.4.66.4.6.44 ATG CAA CAC CAG CAT CCG CGA CGG CTA CTC CTG GAA GAG GTG TAA TAG TCG TTA TTC TTG.6..6.58.48. AAC AAT ACC ACT AGC AGT ATC ATT CTC GAA GAC GAG GAT GCC GCT GGC GGT GTC GTT TTC.6..6.4.6.44 AAC AAG AAT ACC ACT AGC AGT ATC ATT CTC CTT GAC GAG GAT GCC GCT GGC GGT GTC GTT.6.6.54.4.6.44 AAC AAG AAT ATC ATT CAC CAG CTC CTG CTT GAC GAG GAT GCC GGC GTA GTC GTG GTT TAC.6.6.54.4..8 AGA AGG CAA CAG CCA CCG CCT CGA CGG CTA CTG TAA TAG TCA TCG TCT TGA TGG TTA TTG.6.48.6.4.6.44 AAG AGG CAA CAG CCA CCG CCT CGA CGG CTA CTG CTT TAA TAG TCA TCG TGA TGG TTA TTG.6.48.6.4.4.6 AAG CAA CAG CCA CCG CGA CGG CTA CTC CTG CTT GAG TAA TAG TCA TCG TGA TGG TTA TTG.6.48.6.4.. AAG ATG CAA CAC CAG CAT CCG CGA CGG CTA CTC CTG CTT GAG GTG TAA TAG TCG TTA TTG.6.48.6.4.48. AAC AAG AAT ACG ACT AGG AGT ATG ATT CAC CAG CAT CCG CCT CGG CGT CTG CTT GTG GTT.6.48.4.4.6.44 AAC AAG AAT ATG ATT CAC CAG CAT CCG CGG CTA CTC CTG CTT GAC GAG GTC GTG GTT TAG.6.48.4.4..8 AAG ATG CAA CAG CAT CCA CCG CGA CGG CTA CTC CTG CTT GAG TAA TAG TCG TGG TTA TTG.6.54.6.4.4.8 AAG ATG CAA CAC CAG CAT CCG CGG CTA CTC CTG CTT GAC GAG GTC GTG TAA TAG TTA TTG.6.54...4.6 AAG AAT ACG ACT AGG AGT ATG ATT CAA CAC CAG CAT CCG CCT CGG CGT CTG CTT GTG TTG.6.6..4.6.44 AAG AAT ACG ATG ATT CAA CAC CAG CAT CCG CGG CGT CTA CTC CTG CTT GAG GTG TAG TTG.6.6..4..8 AAG AAT ATG ATT CAA CAC CAG CAT CCG CGG CTA CTC CTG CTT GAC GAG GTC GTG TAG TTG.6.6..4.8. AAG AGG ATG CAA CAG CAT CCA CCG CCT CGA CGG CTA CTG CTT TAA TAG TCG TGG TTA TTG.6.66.8...8 AAG AAT ACG ACT AGG AGT ATG ATT CAA CAG CAT CCA CCG CCT CGG CGT CTG CTT TGG TTG.6.66.4.4.6.44 AAG AAT ACG AGG ATG ATT CAA CAG CAT CCA CCG CCT CGG CGT CTA CTG CTT TAG TGG TTG.6.7.8.4.6.44 AAG AAT ACG AGG ATG ATT CAA CAC CAG CAT CCG CCT CGG CGT CTA CTG CTT GTG TAG TTG.6.7.8.4..8 ACA ACT AGA AGT CCA CGA GCA GCC GGA GGC TAA TCA TCC TCG TCT TGA TGC TGG TGT TTA..6..7.6.44 ACA ACC AGA CGA GCA GCC GGA GGC GGT GTA TAA TAC TCA TCC TCG TCT TGA TGC TGT TTA..6..5.6.44 ACA ACC AGA GAC GCA GCC GGA GGC GGT GTA GTC TAA TAC TCA TCC TCT TGA TGC TGT TTA..6.6.46..8 ACA ACC ACT AGA AGT GAC GCA GCC GGA GGC GGT GTC TAA TCA TCC TCT TGA TGC TGT TTA..6.66.6.6.44 AAT ACA ACC ACT AGA AGC AGT ATC ATT GAC GAT GCC GCT GGA GGC GGT GTC TCC TCT TGT..6.7..6.44 AAC AAT ACC ACT AGA AGC AGT ATC ATT GAC GAT GCC GCT GGA GGC GGT GTC GTT TCC TCT..6.78.4.6.44 ACA ACT AGA AGT CCA CCG CGA CGG GCA GGA TAA TCA TCC TCG TCT TGA TGC TGG TGT TTA...6.7.6.44 ACA ACC ACT AGA AGT CGA GCA GCC GGA GGC GGT TAA TCA TCC TCG TCT TGA TGC TGT TTA..8.48..6.44 AAT ACA ACC ACT AGA AGC AGT ATT GAC GCC GCT GGA GGC GGT GTC TCA TCC TCT TGA TGT..8.6..6.44 AAC AAT ACC ACT AGA AGC AGG AGT ATC ATT CCT GAC GAT GCC GCT GGC GGT GTC GTT TCT..4.6.4.6.44 AAC AAG AAT ACT AGC AGT ATC ATT CAC CTC CTT GAC GAG GAT GCC GCT GGC GTC GTG GTT..4.6.4.8. AAC AAG AAT AGC ATC ATT CAC CTC CTT GAC GAG GAT GCC GCT GGC GTA GTC GTG GTT TAC..4.6.4.4.6 ACA AGA AGG CCA CCG CCT CGA CGG CTA GCA TAA TAG TCA TCG TCT TGA TGC TGG TGT TTA...6.5.6.44 AAT ACA ACC ACG ACT AGA AGC AGT ATT CGT GCC GCT GGA GGC GGT TCA TCC TCT TGA TGT...48..6.44 AAC AAG AAT ACC ACT AGC AGG AGT ATC ATT CCT CTT GAC GAT GCC GCT GGC GGT GTC GTT...54.4.6.44 ACA AGA AGG CAG CCA CCG CCT CGA CGG CTA CTG TAA TAG TCA TCG TCT TGA TGG TGT TTA..6.6.46..8 AAT ACA ACC ACG ACT AGA AGC AGG AGT ATT CCT CGT GCC GCT GGC GGT TCA TCT TGA TGT..6.4..6.44 AAC AAG AAT ACC ACG ACT AGC AGG AGT ATC ATT CCT CGT CTT GAT GCC GCT GGC GGT GTT..6.48.4.6.44 AAC AAG AAT ACT AGT ATC ATT CAC CAG CTC CTG CTT GAC GAG GAT GCC GGC GTC GTG GTT..6.48.4.8. ACA ACT AGA AGG AGT CCA CCG CCT CGA CGG GCA TAA TCA TCG TCT TGA TGC TGG TGT TTA..4..6.6.44 AAT ACA ACC ACG ACT AGA AGC AGG AGT ATT CCG CCT CGG CGT GCT GGT TCA TCT TGA TGT..4.6..6.44 AAT ACA ACG ACT AGA AGC AGG AGT ATT CCA CCG CCT CGG CGT GCT TCA TCT TGA TGG TGT..48...6.44 AAC AAG AAT ACC ACG ACT AGC AGG AGT ATG ATT CAT CCG CCT CGG CGT CTT GCT GGT GTT..48.6.4.6.44-7 -

AAC AAG AAT ACT AGT ATG ATT CAC CAG CAT CCG CGG CTC CTG CTT GAC GAG GTC GTG GTT..48.6.4.8. ACA ACT AGA AGG AGT CAG CCA CCG CCT CGA CGG CTG TAA TCA TCG TCT TGA TGG TGT TTA..54.8.6.6.44 AAT ACA ACG ACT AGA AGG AGT ATT CAG CCA CCG CCT CGG CGT CTG TCA TCT TGA TGG TGT..54.4..6.44 AAC AAG AAT ACC ACG ACT AGG AGT ATG ATT CAG CAT CCG CCT CGG CGT CTG CTT GGT GTT..54..4.6.44 AAC AAG AAT ACG ACT AGT ATG ATT CAC CAG CAT CCG CGG CGT CTC CTG CTT GAG GTG GTT..54..4.8. AAG AAT ACA ACC ACG ACT AGG AGT ATG ATT CAG CAT CCG CCT CGG CGT CTG CTT GGT TGT..6.4.4.6.44 AAC AAG AAT ACG ATG ATT CAC CAG CAT CCG CGG CGT CTA CTC CTG CTT GAG GTG GTT TAG..6.4.4.4.6 AAT ACA ACG ACT AGA AGG AGT ATG ATT CAG CAT CCA CCG CCT CGG CGT CTG TCT TGG TGT..66...6.44 AAG AAT ACA ACG ACT AGG AGT ATG ATT CAG CAT CCA CCG CCT CGG CGT CTG CTT TGG TGT..66.8.4.6.44 AAT ACA ACC ACG ACT AGA AGC AGT ATC ATT CGT GAT GCC GCT GGA GGC GGT TCC TCT TGT.8.4.48..6.44 AAC AAT ACC ACG ACT AGA AGC AGG AGT ATC ATT CCT CGT GAT GCC GCT GGC GGT GTT TCT.8.4.54.4.6.44 AAT ACA ACC ACG ACT AGA AGC AGG AGT ATC ATT CCT CGT GAT GCC GCT GGC GGT TCT TGT.8..4..6.44 AAT ACA ACC ACG ACT AGA AGC AGG AGT ATG ATT CAT CCG CCT CGG CGT GCT GGT TCT TGT.8.4...6.44 AAT ACA ACG ACT AGA AGC AGG AGT ATG ATT CAT CCA CCG CCT CGG CGT GCT TCT TGG TGT.8.48.4..6.44 AAG AAT ACA ACC ACG ACT AGC AGG AGT ATG ATT CAT CCG CCT CGG CGT CTT GCT GGT TGT.8.54.4.4.6.44 Table Lst of the 88 self-complementary crcular codes generated by the Nucleotde Frequency method (NF method) accordng to the 6 probabltes p(a) = p(t), p(c) = p(g), p(g) = p(c), p(t) = p(a), p(a) = p(t) and p(c) = p(g). Nucleotde (n=) Entropy n the frame modulo Classcal entropy H n H NF =.965 H =.998 Trnucleotde (n=) H TF =.98 H =.984 Table 4 Computaton of dfferent types of entropes (bt per nucleotde) from the occurrence frequences of the 64 trnucleotdes n the frame modulo and n the frames (average frame) of prokaryotc proten codng genes (686 sequences, 478758 trnucleotdes; data from Arquès & Mchel, 996, p. 49). - 8 -