You are on page 1of 8

J. theor. Biol.

(1984) 107, 697-704

LETTER TO THE EDITOR Is the I n f o r m a t i o n C o n t e n t of D N A Evolutionarily Significant?


It has been suggested (Subba Rao, Hamid & Subba Rao, 1979; Subba Rao, Geevan & Subba Rao, 1982) that the information content of the coding regions in DNA tends to increase with evolution and, therefore, is a suitable indicator of evolutionary progress. In order to re-examine this hypothesis, I have modified the method used by Subba Rao et al. (1982) in such a way that the numerical results are much less sensitive to the amino acid composition of the polypeptide corresponding to the DNA sequence under consideration. By using this modified procedure, I present evidence that the hypothesis of Subba Rao et al. (1982) is not valid for a wide range of evolving genes. As a measure of the information content of a coding region in a DNA sequence, Subba Rao et al. (1982) have used the following function: H = - Y. Pi log2 Pi
i=1

(1)

where Pi is the frequency of ith codon in the DNA sequence under consideration. This function, first defined by Shannon (1948), is called the entropy of a probability distribution. Shannon (1948) has demonstrated that H reaches the maximum value/-/max = log2 n when all the codons are equiprobable (i.e. for every i, p~ = 1/n).//max depends only on the number of possible codons, n = 64, and is the same (log2 6 4 = 6 bits/codon) for all coding regions in DNA regardless their length and origin. The mathematical consequence of this fact is that the entropy H depends on the relative frequencies of codon usage and is independent of the length of a DNA coding region. Because of the genetic code degeneracy, the entropy of a DNA sequence depends on the amino acid composition of the corresponding polypeptide, regardless of its length. Therefore, we should not use the H function for comparison of the genes coding for different polypeptides. We can, however, use this function to compare genes coding for the same polypeptide either in the same genome (a multigene family) or in different species. Note that the smallest value of H is equal to zero. This value occurs in 64 hypothetical cases when only one of 64 codons is used with frequency
697 0022--5193/84/080697+08 $03.00/0 O 1984 Academic Press Inc. (London) Ltd.

698

A. KONOPKA

of 1 and all others are not present in a coding region of a DNA sequence. All such cases can be considered to be the most "non-equiprobable" distributions of the codon usage frequencies. Accordingly, the entropy H is a measure of the deviation from "non-equiprobability" in codon usage. The appropriate complementary measure of the deviation from equiprobability is the following lin'ear function of H called redundancy (Shannon, 1948): R = 1 -H/Hm~x. (2)

R is equal to 0 when H=Hmax and to 1 when H = 0 . Note that the 64 cases in which the entropy of DNA is equal to 0 are not biologically significant because they correspond to polypeptides, such as poly-Leu and poly-Arg, which presumably do not play any biological role. The "biologically significant" minimum value of H is equal to the entropy Hp of protein corresponding to the coding region of DNA and is given by the expression: Hp =--~. ~j log2 ~j i (3)

where ~ri is the probability of amino acid j in the polypeptide corresponding to the DNA sequence under consideration. The maximum "biologically significant" value of R is therefore equal to: Rmax = 1 - H p / H , . ~ x (4)

and not to 1 as in the hypothetical 64 cases of H = 0. For this reason I will use the relative deviation from equiprobability defined as follows: D=R/Rmax. (5)

The function D varies from 0 to 1 and is independent of the length of the DNA sequence under consideration. In contrast to H, the values of D are insensitive to variations in the amino acid compositions of the polypeptides corresponding to the DNA sequences studied. This insensitivity is due to the normalizing effect of Rmax on R. Evidence has been presented that rate of synonymous (silent) substitutions is much higher than the rate of amino acids substitution (Grunstein, Schede & Kedes, 1976; Kimura, 1977; Perler et al., 1980; Jukes, 1980; Miyata, Yasunaga & Nishida, 1980; Miyata et al., 1982). Also the rate of synonymous substitution is roughly uniform (5.37 x 10-9/site/yr) for different genes regardless of the type and origin (Miyata et al., 1980, 1982; Miyata & Yasunaga, 1980).

LETTER

TO THE

EDITOR

699

Therefore, I also consider the relative deviations from equiprobability in synonymous codon usage defined by the expression: Dsy, = (He - H ) / ( H e - Hp) (6)

where He is the entropy of a DNA coding region in the hypothetical case of equiprobable synonymous codon usage (i.e. Pi are equal to ~rj/r~ for all rj synonymous codons for a given amino acid j). Note that substitution of the probabilities p~ = ~rj/h into equation (1) yields the following expression for He: He = -Y~ ~rj log2 (~j/rj). (7)
]

A numerical example of the computation of D and Dsyn is presented in the Appendix. Subba Rao et al. (1982) have pointed out that mutations in the human hemoglobin genes tend to occur such that the frequency of a codon which mutates is greater than the frequency of a codon to which it mutates. By using this observation, Subba Rao et al. (1982) have concluded that the codon frequency distribution should be "more equiprobable" after a mutation than before it. The entropy of a coding region in a DNA should therefore be a non-decreasing function of the number of DNA generations (the value of H does not change if no mutations occur, whereas H tends to increase if mutations occur). From this statement, Subba Rao et al. (1982) have formed their main conclusion that the information content of DNA tends to increase with evolution. In order to determine whether the above hypothesis is valid for a large variety of evolving genes, I have first compared the relative deviations from equiprobability for the average mRNA sequences from different groups of organisms. Secondly, I have considered the D values for quickly and slowly evolving genes. The values of H, Hp, D and Dsyn for average genes from different groups of organisms are listed in Table 1. Codon usage data were taken from Grantham et al. (1981) for non-mitochondrial genes and from Anderson el al. (1981) for human mitochondrial genes. We can see from the fourth column of this table that the D values are greater for eukaryotic organisms and bacteria than for viruses. It also appears that all the organisms listed in Table 1, except the viruses, tend to use the codons with comparable deviation from equiprobability. Despite the fact that mitochondrial genes are known to evolve about seven-fold faster than nuclear protein coding sequences (Brown et al., 1982, Miyata et al., 1982), the D value for the average human mitochondrial and nuclear genes are almost the same. Moreover, the value of D for the average

700

A. KONOPKA TABLE 1

The divergencies from equiprobability for average m - R N A sequences from different groups of organisms
Group of organisms RNA viruses DNA viruses Bacteria Non-mammals Mammals Man Human mitochondria Entropy of m-RNA 5-810 5.757 5-631 5.600 5.635 5.522 5.405 Entropy of protein 4.133 4.194 4.133 4.090 4.173 4.155 3.760 D 0.102 0" 134 0.198 0'209 0.200 0.259 0.266 D~yn 0.019 0.049 0-113 0.127 0.126 0.182 0-160

mitochondrial gene is greater than the value for the corresponding human nuclear gene. This suggests that the relative deviation from equiprobability D (and by this, the entropy H) is not a sensitive indicator of the evolutionary differences between the average genes from different species. The same conclusion may be drawn from comparison of the D.~ynvalues listed in the fifth column of Table 1. The arguments against the hypothesis of Subba Rao et al. (1982) also appear from the comparison of D values for quickly and slowly evolving genes. The eukaryotic globin genes are known to be evolving 30-300 fold faster than the histone genes (Dayhoff, 1978). In the light of the Subba Rao et aL hypothesis, one would therefore expect the D values to be greater for histones than for globin genes. In Table 2 we have listed the values of H, Hp, D and Dsyn for several histone and globin genes using data on codon usage taken from Grantham et aL (1981). It appears from this table that the D and D~y, values for histones are generally lower than the values for globin genes: this is in contrast to the implications of the hypothesis under consideration. Furthermore, the entropy H does not appear to be a good indicator of evolutionary differences between histones (slowly evolving proteins). Whereas the relative evolutionary rates for H2B, H2A and H3 histones are equal to 0.9, 0-5 and 0.14PAMs per 100 million years respectively (Dayhoff, 1978), deviations from equiprobability in codon usage decrease in the order H2B > H3 > H2A (see Table 2). Again, this is not in agreement with the implications of Subba Rao et al.'s hypothesis. Comparison of D~ynvalues for quickly evolving genes (e.g. c~-globin and pseudo a-globin genes) also does not support the hypothesis that H tends to increase with evolution. The rate of synonymous substitutions in

LETTER TO THE EDITOR TABLE 2

701

The divergencies from equiprobability for quickly and slowly evolving genes
Gene Histone Hlt Histone H2A Histone H2Bt Histone H3 Mouse igtbll Mouse a-globin Rabbit a-globin Human a-globin Chicken a-globin Mouse/3-globin I Mouse/3-globin A Rabbit/3-globin Human/3-globin Chicken/3-globin Entropy of m-RNA 5.113 5"289 4.993 5.125 5.407 4.987 4.661 4.607 4.889 5"106 5,122 4,907 4.846 5.001 Entropy of protein 3,838 3,840 3.963 3.888 4.057 3.863 3.969 3.848 3-965 4.016 4.010 3.932 3.944 4.109 D 0.410 0-329 0.495 0.415 0.306 0.474 0.659 0.647 0-546 0.451 0.442 0.529 0.561 0.528 Dsyn 0-234 0.211 0.400 0.308 0-205 0-362 0.595 0.564 0-451 0-352 0.336 0.419 0-455 0.462

t From Ps. miliaris. Mean values of H, Hp, D and D~y, for Ps, miliaris and Strongylocentrotus purpuratus. II Mean values of H, Hp, D and D~y, for the eight mouse immunoglobin genes. pseudogenes is a b o u t two fold greater than the rate for functional nuclear genes (Miyata et al., 1982). For h u m a n pseudo a - g l o b i n gene I have found D~yn = 0.571, a value c o m p a r a b l e to that for the functional h u m a n a - g l o b i n gene (see Table 2). A similar result has been obtained for the comparison of mouse a - g l o b i n gene with a corresponding pseudogene (D~yn is equal to 0.362 and 0.366 for a - g l o b i n gene and its pseudogene respectively; data on pseudo a-globin genes were taken from E M B L Nucleic Acids Sequences Library). Both these analyses suggest that H does not tend to increase with evolution of quickly evolving genes. This completes the arguments against the Subba R a o et al. (1982) hypothesis. In the light of the a b o v e conclusion, the results described by Subba R a o et al. (1982) for h u m a n hemoglobin genes seem to be artifacts. This is s u p p o r t e d by the evidence of a p r e d o m i n a n t selection against p r e m a t u r a t e S T O P codons in hemoglobins evolution (Modiano, Batistuzzi & Motulsky, 1981) and the statistical evidence of selection against Pro and Cys site substitution (Golding & Strobeck, 1982). An observation arising f r o m the calculations r e p o r t e d in this study is that the genes which occur in clusters (such as a - g l o b i n s and histones) have higher values of D and Dsy n than the genes which occur as single copies (such as viral genes). I have also c o m p a r e d the D and Dsy, values for average highly and weakly expressed bacterial genes using codon usage

702

A. KONOPKA

data taken from G o u y & Gautier (1982). The D values are equal to 0.439 and 0.137 for highly and weakly expressed bacterial genes respectively. The corresponding values of Dsyn are equal to 0-345 and 0.061. This suggests that highly expressed genes have greater D and Dsyn values than weakly expressed genes. In the light of these observations the idea that the D and Dsyn values (and by this, the entropy H ) could be the indicators of error-protectivity in a single act of protein biosynthesis seems to be worth further development. The conclusion which can be drawn from this letter is that the Subba R a o et aL hypothesis is not valid in the four general cases studied above, and thus, by extrapolation, may be invalid for a large variety of evolving genes. Although, the question of the validity of this hypothesis in the case of tightly constrained genomes (viruses and phages) still remains open, preliminary calculations indicate that the information content in these systems also does not tend to increase with evolution. It also appears that the deviation from equiprobability in codon usage could be an important indicator of error protectivity in a single act of protein biosynthesis. I would like to thank Professor Manfred Eigen for discussions regarding the notion of deviation from equiprobability. The anonymous referee and Dr Tom Jovin are acknowledged for pointing out that the organisms which have tightly constrained genom sizes also have low deviations from equiprobability in codon usage. L. McIntosh and C. Letande are acknowledged for their careful reading of the manuscript. This work was supported by Max-Planck-Gesellschaft.
Max-Planck-lnstitut fur Biophysikalishe Chemie,

ANDRZEJ KONOPKA

D-3400 Gottingen, West G e r m a n y


(Received 29 September, and in final form 26 N o v e m b e r 1983)

REFERENCES ANDERSON, S. et aL (1981). Nature 290, 457. BROWN,W. M., PRAGER,E. M., WANG,A. & WILSON,A. C. (1982). J. mol. Evol. 18, 225. DAYHOFF, M. (1978). Atlas of Protein Sequence and Structure. Vol. 5. Supplement 3. Washington: National Biomedical Research Foundation. GOLDING, G. B. & STROBECK,C. (1982). J. tool. EvoL 18, 379. GouY, M. & GAUTIER,C. (1982). Nucleic Acid Res. 10, 7055. GRANTHAM, R., GAUTIER,C. & GOUY, M. (1980). Nucleic Acid Res. 8, 1893. GRANTHAM, R., GAUTIER, C., GOUY, M., JACOBZONE, M. & MERCIER, R. (1981). Nucleic Acids Res. 9, r43. GRUNSTEIN, M., SCHEDE, P. & KEDES, L. (1976). J. tool. Biol. 104, 351. JUKES, T. H. (1980). Science 210, 973. KIMURA, M. (1977). NAture 267, 275. MIYATA,T. & YASUNAGA,T. (1980). J. rnoL EvoL 16, 23. MIYATA,T., YASUNAGA,T. & NISHIDA,T. (1980). Proc. hath. Acad. Sci. U.S.A. 77, 7328. MIYATA, T. et al. (1982). J. mol. Evol. 19, 28.

LETTER

TO THE EDITOtx

703

MODIANO, G., BATFISTUZZI, G. & MOTULSKY, A. G. (1981). Proc. nam. Acad. ScL U.S.A. 78, 1110. PERLER, F. et aL (1980), Cell 20, 555. SHANNON, C. E. (1948). Bell Syst. Tech. J. 27, 379. SUBBA RAO, G., HAM1D, Z. & SUSBA RAO, J. (1979). theor. Biol. 81, 803. SUBBA RAO, J., GEEVAN, C. P. & GITA SUBBA RAO (1982). J. theor. Biol. 96, 571. APPENDIX TABLE 3

Frequencies o f codon u s a g e in M 1 3 gene 1 ( t a k e n f r o m G r a n t h a m et al., 1 9 8 0 )


Codon Arg C G A CGC CGG CGU AGA AGG Leu C U A CUC CUG CUU UUA UUG Ser UCA UCC UCG UCU AGC AGU Thr ACA ACC ACG ACU Pro CCA CCC CCG CCU Ala G C A GCC GCG GCU Gly G G A GGC GGG GGU Frequency 103 0 9 0 28 14 9 5 14 9 37 42 14 0 9 5 42 9 5 0 5 9 42 0 0 14 23 14 0 5 37 9 28 14 23 Codon Vat G U A GUC GUG GUU Lys A A A AAG Asn A A C AAU Gln CAA CAG His CAC CAU Glu G A A GAG Asp GAC GAU Tyr UAC UAU Cys UGC UGU Phe UUC UUU Ile A U A AUC AUU Met A U G Trp U G G Frequency x 103 9 9 0 60 51 23 14 33 14 23 0 14 14 9 9 65 5 37 9 5 5 28 14 0 47 5 23

704

A. KONOPKA

As a numerical example of the computation of/9, consider the frequencies of codon usage for M13 gene 1 taken from Grantham et al. (1981) and listed in Table 3, According to equation (1) the entropy H for this frequency distribution is equal to 5.2820 bits/codon. H m a x is equal to six bits/codon for every DNA coding region. Thus for the gene under consideration, R=0-1197. The frequencies of'amino acids are equal to the sum~ of synonymous codons (for expTrArg=0-0+0"009+0.0+0-028+0-014+ 0.009 = 0-060). The entropy Hp is then equal to 4.0911 bits/codon. The biologically significant maximum deviation from equiprobability according to equation (4) is equal to R m a x = l - 4 " 0 9 1 1 / 6 = 0 " 3 1 8 2 and thus D = 0-1197/0-3182 =0-3760. As a brief example of computation of D.~ynlet us once again consider the data on the M13 gene 1 listed in Table 3. For computation of He we require the values of ~rj and ~rj/r~ for every amino acid ]. As an example let us compute these values for arginine. The frequency of arginine is equal to rArg= 0"060. The degree of degeneration for arginine is equal to rArg= 6 and therefore 7rArg/rArg = 0"060/6 = 0"010. Substitution values obtained in this way for all amino acids into equation (7) yields He = 5-7983 bit/codon. Accordingly, formula (6) yields Dsy, = (5.7983- 5-2821)/(5.7983 -4.0911) = 0.3024.

You might also like