You are on page 1of 10

J. theor. Biol.

(1997) 187, 297306

A Symmetrical Theory of DNA Sequences and Its Applications


C-T Z

Department of Physics, Tianjin University, Tianjin 300072, China

(Received on 1 August 1996, Accepted in revised form on 3 January 1997)

A unified symmetrical theory of DNA sequences has been established based on the basic symmetry of
the DNA bases. It is shown that the symmetry of DNA sequences is inherently related to that of a cube
and its inscribed regular tetrahedron. A DNA group is defined as a particular alternating group of order
4, in which the permuted objects are four bases. The symmetry of DNA sequences is described by the
DNA group which is isomorphic to the tetrahedral group. The matrix representation for the DNA
group has been obtained, and used to establish the relationships between the transforms of bases and
the rotations of the tetrahedron. It is found that any DNA sequence can be uniquely described by three
independent distributions, i.e., the distributions of the bases of purine/pyrimidine, of amino group/keto
group and of strong/weak hydrogen bonds along the sequence. The three distributions are invariant
in some sense under the transforms of the DNA group, indicating that the three distributions are
inherent for the sequence. The mathematical format of the theory lays a foundation for further
development. The applications of the theory to analyse some DNA sequences are presented.
7 1997 Academic Press Limited

1. Introduction bonds are formed in the G-C pair, while two


hydrogen bonds are formed in the A-T pair. This
We are living in the age of information explosion, symmetry of WatsonCrick pair leads to a basic
especially the explosion of biological information. symmetry of DNA double helix: the double helix
The DNA sequences constitute the primary biologi- keeps invariant under the transform A t T and
cal information. As pointed out by Gilbert (1991), G t C. The symmetry of the WatsonCrick pair
for 15 years, the DNA database has grown by 60% also leads to the fact that the G + C (or A + T)
a year, a factor of ten every 5 years. The human content in the sequence, rather than those of
genome project has accelerated this rate of increase. individual G or C, possesses a more clear biological
In response to a large amount of DNA information meaning. More examples regarding the symmetry of
in the database, a new paradigm, now emerging, is the DNA sequences are related to the multiple
that the starting point of the study will be theoretical recognition sequences by some enzymes. For in-
cases where most gene sequences are known (Gilbert, stance, the restriction enzyme AccI recognizes the
1991). This paper is devoted to a theoretical study on sequences 5'-GT (A or C) (G or T) AC-3'
the symmetry of DNA sequences. The symmetry of (Cornish-Bowden, 1985). There exists a common
DNA sequences is an important and basic theoreti- amino group in both bases A and C, and a common
cal topic, although to our knowledge, it seems that keto group in both bases G and T. Therefore, the
only a few studies on this topic have been published above multiple recognition sequences exhibit a
(Magarshak & Benham, 1992). The DNA sequence symmetry of amino and keto groups. It is the aim
consists of four kinds of bases, A, C, G and T. The of this paper to establish and present a theory to
sequence exhibits some symmetry which is based on deal with all of the symmetry of DNA sequences in
that of the DNA bases. For example, three hydrogen a unified form.

00225193/97/150297 + 10 $25.00/0/jt970401 7 1997 Academic Press Limited


298 .-.

A C
2. Symmetrical Theory of DNA Sequences NH 2 NH 2


N 5
7
6
1
N 5 4 3N
8
9 2 6 1 2
The starting point of our study is based on the N 4
3
N N O
observation of symmetry in the four DNA bases. The
chemical structures of adenine (A), cytosine (C),
guanine (G) and thymine (T) are shown in Fig. 1. The
nomenclatures of bases and the sets of bases are G T
adopted according to Recommendation 1984 O O
proposed by the NC-IUB (Cornish-Bowden, 1985). H CH 3 H
N 5 6 4
The bases can be divided into two classes, i.e., 7 1
N 5 3
N
8
9 2 6 1 2
purine, R = A, G, 3

Bases
~ N 4
N NH 2 N O
_ pyrimidine, Y = C, T.
The bases can be also divided into another two F. 1. The chemical structures of the four DNA bases.
classes,

Bases
~ amino group, M = A, C, non-G and non-T, and denoted by B, D, H and V,
respectively, (Cornish-Bowden, 1985). There are six
_ keto group, K = G, T.
edges in the tetrahedron, i.e., those of AG, CT, AC,
The above two classifications are based on the pure GT, AT and GC. They correspond to the bases of
chemical structures, as shown in Fig. 1. However, purine, pyrimidine, amino group, keto group, weak
there are stronger hydrogen (H) bonds in the G-C H-bonds and strong H-bonds, respectively. The edges
pair and weaker hydrogen bonds in the A-T pair in AG and CT are situated at the faces of purine and
the double helix. Therefore, the bases can be further pyrimidine, respectively, and AG_CT, as shown in
divided into another two classes according to the Fig. 2. The edges AC and GT are situated at the faces
strength of the hydrogen bonds, i.e., of amino and keto group, respectively, and AC_GT.
Finally, the edges AT and GC are at the faces of weak
~ strong H-bonds, S = G, C,
Bases and strong H-bonds, respectively, and AT_GC. The
_ weak H-bonds, W = A, T.
above symmetrical goemetrical entities, i.e., the cube
The symmetry of DNA bases observed above may
be represented by some symmetrical geometrical Z
entities in somewhat analogous style to that used by
elementary particle theorists in assigning individual
particles to some symmetrical geometrical entities. A
Y
The first geometrical entity that we used to represent
the above symmetry is a cube. Each cube has six
T
identical faces which can be divided into three pairs
according to the three directions. They are the
right/left faces (x direction), the front/back faces
(y-direction) and the up/down faces (z direction). O
Without generality, we assign the purine/pyrimidine X
bases to the right/left faces, respectively. Similarly, we
assign the bases of amino/keto group to the
front/back faces, respectively. Finally, we assign the
bases of weak/strong H-bonds to the up/down faces, C
respectively. Drawing the diagonals at each face in the
manner as shown in Fig. 2, we obtain a regular
tetrahedron. The four vertices are referred to as A, C, G
G and T, respectively, as shown in Fig. 2. In addition F. 2. A cube and its inscribed regular tetrahedron, which are
to the vertices, there are four triangular faces in the used to represent the symmetry of the DNA bases. As usually, A,
C, G and T are referred to as the four DNA bases, respectively.
tetrahedron, i.e., DCGT, DAGT, DACT and DACG. Note that the coordinate system is set up by using the three middle
They correspond to the bases of non-A, non-C, lines of the regular tetrahedron.
299

and the regular tetrahedron, are the foundation of the DNA sequence and the DNA sequence can be
symmetrical theory presented in this paper. uniquely described by the given three distributions.
It should be pointed out that Trainor and
co-workers made use of a similar tetrahedral
representation for the codon sequences (Trainor A regular tetrahedron (RT) is a body of high
et al., 1984). symmetry. The set of all possible motions which keep
the RT fixed in the space forms a terahedral group or
T-group. On the other hand, the set of all possible
Consider a DNA sequence with N bases. Inspect permutations on the set of four symbols forms a
the sequence by stepping one base at a time. Let the symmetrical group S4. The alternating group A4 is an
steps be denoted by n, n = 1,2, . . ., N. Calculate An , invariant subgroup of S4. The T-group and the A4
Cn , Gn and Tn for the nth step, where An (Cn , Gn , Tn ) group are isomorphic to each other. A DNA group
is the cumulative number of occurrence of base A (C, is defined as a particular A4 group in which the
G, T) in the sub-sequence situated at the region from permuted objects are the four bases of a DNA
the first to the nth base in the DNA sequence molecule. In fact, the DNA group and the T-group
inspected. Obviously, An , Cn , Gn and Tn are all positive are the same group from a point of view of the
integers. The four integers (An , Cn , Gn , Tn ) can be abstract group. They have the same group structure
mapped onto a point Pn in the three-dimensional and the same matrix representation.
space, based on the symmetry of the regular To find the matrix representation of the DNA
tetrahedron (Zhang & Zhang, 1994). In the case of group, we study the rotational motions on the RT.
coordinate system set up in Fig. 2, the coordinates xn , Let R be a 3 3 rotational matrix representing a
yn and zn of point Pn can be expressed in terms of An , rotation of the coordinate system around some axis.
Cn , Gn and Tn (Zhang & Zhang, 1991a, b), i.e., Under the rotation R, the representation matrix
[eqn (2)] for a DNA sequence is transformed into
F xn = 2(An + Gn ) n,

2 3 2 3
g yn = 2(An + Cn ) n, x'1 x'2 . . . x'N x1 x2 . . . x N
f zn = 2(An + Tn ) n, y'1 y'2 . . . y'N = R y1 y2 . . . yN , (3)
xn ,yn ,zn $[n,n], n = 1.2, . . .,N. (1) z'1 z'2 . . . z'N z1 xz . . . zN

When n runs from 1 to N, we have points P1, P2, where x'1 , y'1 , z'1 etc. are the new coordinates after the
. . ., PN . The appropriate curve connecting points P1, rotation. The order of the DNA group is 12.
P2, . . ., PN one by one is called the Z curve of the Referring to Fig. 2, we find that the elements of the
DNA sequence (Zhang & Zhang, 1994). The points DNA group consist of I, i.e., the identity operation;
P1, P2, . . ., PN are referred to as the nodes of the Z Rx , Ry , and Rz , i.e., the 180 rotation around the x,
curve. A DNA sequence is uniquely determined by y and z axis, respectively; RA , RC , RG , and RT , i.e., the
the coordinates of the nodes of the Z curve. 120 rotation around the OA, OC, OG and OT axes,
Therefore, a DNA sequence may be represented by a respectively; and RA2 , RC2 , RG2 , and RT2 , i.e., the 240
3 N matrix of the following form rotation around the OA, OC, OG and OT axes,
respectively. The rotational matrix for a rotation of

2 3
x1 x2 . . . x N angle u around the x axis is
(DNA sequence) = y1 y2 . . . yN . (2)

2 3
z1 z2 . . . zN 1 0 0
Rx (u) = 0 cosu sinu . (4)
There are apparent biological meanings in eqns (1), 0 sinu cosu
where xn , yn and zn display, respectively, the
distributions of bases of purine/pyrimidine, amino/ when u = 180, we find
keto group, and strong/weak H-bonds along the

2 3
sequence. When the occurrence number of bases of 1 0 0
purine (amino group, weak H-bonds) in the Rx (180) = Rx = 0 1 0 . (5)
sub-sequence situated from the first base through the 0 0 1
nth base is greater than that of pyrimidine (keto
group, strong H-bonds), then xn q 0 (yn q 0, zn q 0, The matrix representations for the other elements of
otherwise, xn Q 0 (yn Q 0, zn Q 0). The three distri- the DNA group can be found similarly. The total
butions can be uniquely determined by the given representation of the DNA group is summarized as
300 .-.

T 1
Multiplication table of the DNA group
I Rx Ry Rz RA RC RG RT RA2 RC2 RG2 RT2
I I Rx Ry Rz RA RC RG RT RA2 RC2 RG2 RT2
Rx Rx I Rz Ry RC RA RT RG RT2 RG2 RC2 RA2
Ry Ry Rz I Rx RT RG RC RA RG2 RT2 RA2 RC2
Rz Rz Ry Rx I RG RT RA RC RC2 RA2 RT2 RG2
RA RA RT RG RC RA2 RG2 RT2 RC2 I Rx Rz Ry
RC RC RG RT RA RT2 RC2 RA2 RG2 Rx I Ry Rz
RG RG RC RA RT RC2 RT2 RG2 RA2 Rz Ry I Rx
RT RT RA RC RG RG2 RA2 RC2 RT2 Ry Rz Rx I
RA2 RA2 RC2 RT2 RG2 I Rz Ry Rx RA RT RC RG
RC2 RC2 RA2 RG2 RT2 Rz I Rx Ry RG RC RT RA
RG2 RG2 RT2 RC2 RA2 Ry Rx I Rz RT RA RG RC
RT2 RT2 RG2 RA2 RC2 Rx Ry Rz I RC RG RA RT

23 2 3
follows 1 1
(A) = 1 , (C) = 1 ,

2 3 2 3
1 1
1 0 0 1 0 0
I= 0 1 0 , Rx = 0 1 0 ;

23 23
0 0 1 0 0 1 1 1
(G) = 1 , (T) = 1 . (7)

2 3 2 3
1 1
1 0 0 1 0 0
Ry = 0 1 0 , Rz = 0 1 0 ;
0 0 1 0 0 1 Considering the rotation Rx first, we find

2 3 23 2 3
1 0 0 1 1

2 3 2 3
0 0 1 0 0 1 Rx (A) = 0 1 0 1 = 1 = (G), (8)
RA = 1 0 0 , RC = 1 0 0 ; 0 0 1 1 1
0 1 0 0 1 0
and similarly,

2 3 2 3
0 0 1 0 0 1 Rx (C) = (T), Rx (G) = (A), Rx (T) = (C). (9)
RG = 1 0 0 , RT = 1 0 0 ;
So, the transforms of the DNA bases under Rx , Ry Rz
0 1 0 0 1 0
may be expressed as follows
Rx : A \ G, C \ T,

2 3 2 3
(10)
0 1 0 0 1 0
2
R = 0
A 0 1 , RC2 = 0 0 1 ; Ry : A \ C, G \ T, (11)
1 0 0 1 0 0
Rz : A \ T, G \ C. (12)

2 3 2 3
Usually, the transform Rx is called transition, while Ry
0 1 0 0 1 0
and Rz are called transversion. In this paper, Rz is also
RG2 = 0 0 1 , RT2 = 0 0 1 (6)
called a complementary transform specially. The four
1 0 0 1 0 0
elements I, Rx , Ry and Rz form an invariant subgroup
of the DNA group, as can be seen in Table 1. It is
The multiplication table for the DNA group can be isomorphic to the Klein 4-group (K4 group). For the
simply obtained by the products of the matrices in other operations we find
eqns (6). The result is listed in Table 1.
RA : A : A, C : T, T : G, G : C, etc. (13)
We turn to study the transform of the DNA bases
under each of the 12 operational elements of the DNA The 12 operational elements of the DNA group are
group. Consider a DNA sequence with only one base, listed in Table 2, which can also be obtained by
i.e., N = 1. In this case we have four possible DNA rotating the RT in Fig. 2. For example, a 180
sequences, i.e., A, C, G and T. By using eqns (1), rotation of the RT around the z-axis yields the
we obtain their matrix representation in the form of transform of A \ T and G \ C, as can be seen
eqn (2), i.e., clearly in Fig. 2. The K4 group and its two cosets
301

T 2
Transforms of the bases under the Any DNA sequence has an inverted sequence
operations of DNA groups corresponding to it. For example, a sequence is
A C G T ACGT, and the inverted sequence is TGCA. The
I A C G T
Rx G T A C
operation to inverse a sequence is denoted by Rr . The
Ry C A T G operation Rr can be described as follows. Suppose
Rz T G C A that the Z curve representing a DNA sequence with
RA A T C G
RC G C T A
N bases is described by xn , yn and zn . Then the Z curve
RG T A G C representing the inverted sequence can be described
RT C G A T by xn , yn , and zn , where
RA2 A G T C
RC2 T C A G x n = xN xN n ,
RG2 C T G A
RT2 G A C T yn = yN yN n , (14)
zn = zN zN n ,
n = 0, 1, 2, . . ., N.
Therefore, if a Z curve representing the DNA
exhaust the DNA group. So, the elements of the DNA
sequence read from the direction of 5' to 3', then the
group are divided into four classes: i.e., (I), (Rx , Ry ,
Z curve representing the complementary strand in the
Rz ); (RA , RC , RG , RT ); and (RA2 , RC2 , RG2 , RT2 ).
same direction (5' to 3') can be obtained by performing
The three components of the Z curve i.e., xn , yn and
a joint operations RrRz to the original Z curve. In this
zn represent the three independent distributions which
case, in addition to the Rz operation, an operation Rr
completely describe the DNA sequence. The trans-
is still needed. This is due to the fact that the DNA
forms of xn , yn and zn under the 12 elements of the
double helix consists of two anti-parallel strands.
DNA group are shown in Table 3. It is seen that the
three distributions are invariant except their signs are
changed or their positions are exchanged probably
under the 12 operations of the DNA group. Note that What do we mean by the symmetry of DNA
all of xn , yn , and zn are composed of the sums of the sequences? Generally speaking, symmetry means that
numbers of bases. This implies that the sums of the some properties of things keep invariant under some
numbers of bases, rather than those of the individual transform. For example, an object and its image are
bases, have more biological meaning. For a long time, said to be of symmetry because the object keeps
it is well known that the G + C content, rather than invariant under the reflection transform. Similarly, in
the content of individual G or C, is a meaningful the case of DNA sequence, symmetry means that
biological quantity widely used by biologists to some properties of the sequence keep invariant when
describe the DNA sequence. In our theory, the sums a total or a part of its bases undergoes the transforms
A + G, C + T, A + C, G + T, A + T and G + C described by the DNA group. There are two kinds of
appear naturally. symmetry of the DNA sequences, i.e., the global and
the local symmetry. If the property of a sequence
keeps invariant when a total of bases in the sequence
undergoes some transform, the sequence is said to be
of global symmetry with respect to that transform.
T 3
Otherwise, if the property of a sequence keeps
Transforms of x, y and z under the
invariant only when one base or a part of bases in the
operations of DNA group
sequence undergoes some transform, the sequence is
x y z
I x y z
said to be of local symmetry with respect to that
Rx x y z transform.
Ry x y z One of the examples of the global symmetry of
Rz x y z
RA z x y
DNA sequences that we have found is the symmetry
RC z x y of Z curve. The Z curve keeps invariant under the 12
RG z x y operations of the DNA group. To illustrate this,
RT z x y
RA2 y z x
suppose that there are two DNA sequences both
RC2 y z x having N bases. They are represented by Z1 curve (P1,
RG2 y z x P2, . . . PN ) and Z2 curve (Q1, Q2, . . . QN ), respectively,
RT2 y z x
where Pi , and Qi are the nodes of the Z1 and Z2
302 .-.

curves. The two Z curves are said to be of the same In summary, the Z curve, the information
structure, if for any n = 1, 2, . . ., N, we always entropy H and the Fourier power spectrum P(k)
have =OPn = = =OQn =, where =OPn =(=OQn =) is the defined by eqn (17) have the global symmetry with
distance between the origin and the point Pn (Qn ). It respect to the 12 operations of the DNA group.
can be proved that any Z curve representing a DNA Among the 12 operations of the DNA group, the
sequence is invariant under the operations of the complementary transform Rz has a direct biological
DNA group, in the sense that the Z curve and its meaning. It means an exchange of bases at each site
transformed Z curves have the same structure. This between the strand and its complementary strand.
can be proved as follows. Using eqns (1), we find Therefore, the Z curve, the information entropy H
and the Fourier power spectrum P(k) are invariant
rn 0 =OPn = = (xn2 + yn2 + zn2 )1/2
for both strands.
= 2(An2 + Cn2 + Gn2 + Tn2 n 2/4)1/2, (15) The use of a single symbol to designate a variety
of possible bases at a single position indicates that
n = 1, 2, . . ., N.
there is a local symmetry of the DNA sequence
Therefore, rn is invariant under the operations of the according to our nomenclature. For example,
DNA group. Generally speaking, there exist 12 DNA isoleucine is coded by the codons 5'-ATH-3', where
sequences having the same length, which are H means the bases of not-G. This means that the
represented by the same Z curve in the sense that the coding property (coding for isoleucine) is invariant
structures of their Z curves are the same. For when the bases at the third position undergo the Rg
example, the DNA sequence ACGT and its or Rg2 operations. In fact, all the coding sequences
transformed sequences under the operations of the exhibit the local symmetry. Another example of the
DNA group are GTAC, CATG, TGCA, ATCG, local symmetry is about the DNA sequences
GCTA, TAGC, CGAT, AGTC, TCAG, CTGA, and recognized by some enzymes. For example, consider
GACT, respectively. All the 12 sequences have the the recognition sequences by the restriction en-
same Z curve in the sense that the corresponding 12 zymes. Most of the restriction enzymes of type II
Z curves have the same structure. In other words, recognize a simple unique DNA sequence. How-
starting from the Z curve of the sequence ACGT, the ever, some of the restriction enzymes recognize
Z curves of the remaining 11 sequences can be series of derivative sequences, where two or more
obtained by the rotational operations described by bases may present at a particular postion in the
the DNA group. recognition sequence. For example, SduI recognizes
Another invariant quantity of the DNA group is the sequence 5'-G (A or G or T) GC (A or C or
the information entropy H of the DNA sequence, T) C-3' (Cornish-Bowden, 1985). The local sym-
which is defined by metry exists at the second and fifth positions in the
sequence. It would be difficult to find common
4
biochemical features in the set of bases (A, G, T)
H = s pi log2 pi , (16)
i=1
or (A, C, T). In the case of set of (A, G, T), they
are neither all purines nor pyrimidines; they are
where pi (i = 1, 2, 3, 4) are the occurrence frequencies neither all the bases of amino group nor keto
of bases A, C, G and T in the sequence, respectively. group; they are neither all the bases of strong
The concept of isentropic surface has been introduced hydrogen bonds nor weak hydrogen bonds. The
based on this symmetry and used to study the existing phenomenon takes place naturally in our
molecular evolution (Zhang & Zhan, 1994; Wang & theory. The recognition feature is invariant under
Zhang, 1996). the operations Rc and Rc2 at the second position and
The Fourier transform of xn , yn , zn will be an Rg , Rg2 at the fifth position. The local symmetry
important tool for studying the three distributions. described by the DNA group widely exist in the
Suppose that X(k), Y(k) and Z(k) are the discrete DNA sequences. Besides the recognition sequences
Fourier transforms of xn , yn , and zn , respectively, by the restriction enzymes, the local symmetry also
where k = 1, 2, . . ., N, then the power spectrum is exists in the sequences recognized by some other
defined as enzymes or proteins. Generally speaking, there must
exist a local symmetry in any DNA sequence which
P(k) = =X(k)= 2 + =Y(k)= 2 + =Z(k)= 2, (17)
includes at least one of the following 11 symbols
where =X(k)= means the module of X(k), etc. It is easy within the sequence: R, Y, M, K, S, W, H, B, V,
to see from Table 3 that for any given value of k, P(k) D, and N, where N means any base. We will
is invariant under the operations of the DNA group. discuss this problem in more detail later.
303

T 4
Transforms of 10 codes under the operations of DNA group
R Y K M S W B D H V
I R Y K M S W B D H V
Rx R Y M K W S H V B D
Ry Y R K M W S D B V H
Rz Y R M K S W V H D B
RA M K S W Y R B V D H
RC K M W S Y R H D V B
RG K M S W R Y V B H D
RT M K W S R Y D H B V
RA2 W S Y R K M B H V D
RC2 W S R Y M K V D B H
RG2 S W R Y K M D V H B
RT2 S W Y R M K H B D V

3. Applications

Since the discovery of the first type II restriction

endonuclease, the volume of type II restriction

endonuclease available in a purified form has been
increasing at a remarkable speed. The type II enzymes
In Table 1 of Recommendation 1984 by the represent those that both recognize and cleave the
NC-IUB (Cornish-Bowden, 1985), 14 single-letter DNA molecule at a specific sequence. As is well
codes representing incompletely specified bases in known that most restriction endonucleases recognize
nucleic acid sequences were recommended. They are simple unique DNA sequences. However, a growing
A (Adenine), C (Cytosine), G (Guanine), T class of endonucleases includes those that recognize
(Thymine), R (puRine), Y (pYrimidine), M (aMino), series of derivative sequences, where two or more
K (Keto), S (Strong interation), W (Weak inter- bases may be present at a particular position in the
action), H (not- G, H follows G), B (not-A, B follows recognition sequence. As mentioned previously, the
A), V (not-T or not -U, V follows U), and D (not-C, sequence displays a local symmetry. Since the
D follows C). As discussed in Section 2, the 14 codes symmetry of the DNA sequences, according to our
can be represented by four vertices, six edges and four theory, should be described by the DNA group, there
triangular faces of a regular tetrahedron. There is an should be some rules to control the multiple
obvious equality: 4 + 6 + 4 = 14. The tetrahedral recognizing DNA sequences. For example, the
representation of the 14 codes is useful for enzyme SduI recognizes the sequence 5'-GDGCHC-3'
memorizing these symbols and for comprehending the (Perbal, 1990). Performing the operation Rz to this
relations of them. sequence and considering Table 4 or Fig. 3, we obtain
The transforms of these 14 codes under the 12 the transformed sequence 5'-CHCGDG-3'. Perform-
operations of the DNA group are listed in Table 4. ing a reflection operation Rr to the recognition
Of the 12 operations the complementary element Rz sequence, we find that the reversed sequence is
has apparent biological meaning. Under Rz , the 5'-CHCGDG-3'. The two sequences are identical, or
symbols are transformed into those in the comp- simply, Rz = Rr , indicating that the two operations
lementary strand, respectively. Since the z-axis in are equivalent in this case. A multiple DNA sequence
Fig. 2 is perpendicular to the edges of weak (W) and is defined as a DNA sequence which consists of at
strong (S) hydrogen bonds, the symbols W and S least one of the following 10 symbols: R, Y, M, K, S,
are invariant under the rotation around the z-axis. W, H, B, V and D. A multiple DNA sequence with
The transform of other symbols under Rz can be
obtained easily by rotating the tetrahedron around R Y M K S W D H B V
the z-axis. The details of the transform for the 14
codes under Rz are specially shown in Fig. 3. Our
work gives both Table 1 and Table 2 of Recommen-
dation 1984 by the NC-IUB (Cornish-Bowden, R Y M K S W D H B V
1985) an intuitive explanation and a simple way for F. 3. Display intuitively the transforms of 10 codes under the
memory. Rz (complementary) operation.
304 .-.

the property that Rz = Rr is called a self-complemen- 8


tary sequence. Looking at the list of restriction 7
endonucleases (Perbal, 1990), we find that all the 6
multiple recognizing sequences are self-complemen- 5
tary. So, the first empirical rule that describes the
4

z axis
multiple recognizing sequence may be summarized as
follows: 3
2
Rule 1: all the multiple recognizing sequences of the
1
type II restriction endonucleases should be self-comp-
lementary. 0
1 8
Note from Fig. 3 that the two symbols in any one 1.0 6
0.5 0 2 4
of the following pairs (R, Y), (M, K), (D, H) and (B, 0
0.5 4 2 y axis
V) are complemenatary with each other, while S and x axis 1.0 8 6
W are complementary with itself, respectively. F. 4. The Z curve (the gray one) for the DNA sequence
Therefore, as a complement to Rule 1, we still have 5'-TGTGTGAATTGTACGTACAATTCACACA-3', which has
28 bases. Performing the Rr operation to the gray curve, as
Rule 2: the codes (R, Y), (M, K), (D, H) and (B, described by eqn (14), we obtain the Z curve corresponding to the
V) occurring in the multiple recognition sequences inverted sequence, which is represented by the dark black one. Both
curves are smoothed by the method of spline function (Zhang &
must appear in pair, while the codes W and S in the Zhang, 1994). Note that the two curves are symmetrical with
recognition sequences may either be in pair or single. respect to the z-axis, indicating that the sequence studied is an
inverted repeating one. For more details, see the text.
So far as we know that there is no exception of
these two rules for the multiple recognizing sequences
now available (Perbal, 1990). Furthermore, we can sequence 3'-X'Y'Z'ZYX-5'. Performing the Rr oper-
predict the possible recognizing sequences not ation to the obtained sequence again, we have the
available at present. For example, the sequences sequence 5'-XYZZ'Y'X-3', which is identical to the
SGGCCS, GBGCVC, CCSSGG, RGGSCCY etc. are inverted repeat sequence. Therefore, the condition for
possible substrates of the restriction endonucleases, a sequence to be an inverted repeat sequence may be
based on the two rules proposed above. However, expressed by a simple formula: RrRz = RzRr = I,
whether such enzymes exist that recognize the above where I represents the identity operation. Using the
multiple sequences is still an unsolved problem. The multiplication table listed in Table 1, the above
two rules are only the necessary but not the sufficient formula may be rewritten as Rr = Rz . For example,
conditions for the multiple sequences which could be 5'-CCGCGCGCGG-3' is a typical inverted repeat
recognized by the restriction endonucleases. sequence. A more complicated example is 5'-TGTGT-
GAATTGTACGTACAATTCACACA-3', which has
28 bases. The Z curve for this sequence is shown by
the gray curve in Fig. 4. Performing the Rr operation
The technique of computer graphics of the Z curves to the gray curve, as described by eqn (14), we obtain
provides a possibility to detect some symmetrical the Z curve corresponding to the inverted sequence,
patterns of the DNA sequences visually. Let us take which is represented by the dark black curve in Fig. 4.
the inverted repeat sequence as an example, which Both curves are smoothed by the method of spline
exhibits an important symmetry of the DNA function (Zhang & Zhang, 1994). Rotating the gray
sequences. The inverted repeat sequences usually play curve 180 around the z-axis, we find that the rotated
a key role in the recognition process between the Z curve coincides with the dark black curve. In other
DNA molecules and the proteins or enzymes. The words, the two curves are symmetrical with respect to
format of the Z curve provides a method for finding the z-axis, indicating that the sequence studied is an
such a symmetry visually and quickly. One kind of inverted repeat sequence. Thus, we find the symmetry
inverted repeat sequence is the so-called self-comp- of the inverted repeat of the DNA sequence by a pure
lementary sequence mentioned above, having the geometric approach. This method provides a simple
form: 5'-XYZZ'Y'X'-3', where X, Y and Z are bases and quick way to find the inverted repeat sequences
or sets of bases and X', Y' and Z' are complementary visually. Some other symmetrical patterns of DNA
to X, Y and Z, respectively. Such sequences may form sequences can be detected visually by any trained
the palindrome structures. Performing the Rz researcher. It would be of considerable advantage to
operation to the above sequence, we obtain the analyse a great amount of DNA sequences quicky
305

and efficiently by this approach because the human only very nice, but also very useful for the further
brain and eyes are generally more sensitive to analysis of DNA sequences.
an intuitive graph than to an abstract symbolic The concept of the three distributions in a DNA
sequence. sequence presented in this paper is an important and
a novel one. So far as we know no similar concept
has been reported in the literature. We can see from
4. Discussion and Summary Table 3 that the three distributions are invariant in
As a language (Trifonov & Brendel, 1986), the some sense under the 12 operations of the DNA
DNA sequence may have a number of distinct forms group. The three distributions are inherent for the
of expression, although all of them are equivalent. DNA sequence. Any DNA sequence is uniquely
The traditional form of the DNA sequence is a letter described by the three distributions. Therefore, the
string, consisting of four kinds of letters A, C, G and study of the DNA sequence may be carried out by
T. This form is widely used in biochemistry and the study on the three distributions. This new
molecular biology. It is necessary for the storage of approach for studying the DNA sequence seems to
information and some statistical analysis. Therefore, be very promising.
as a main form, it will be used continuously. In conclusion, the study presented here and those
However, the form of letter sequence is extremely published previously (Zhang & Zhang, 1994) have
difficult to recognize and find some special patterns opened a new research area in which the geometrical
intuitively in the sequence. For a long sequence, the methods and concepts are widely introduced into the
form is too abstract to impress the observers. To study of DNA sequences. The symmetrical theory of
overcome the drawback of letter sequence, the DNA sequences is based on this geometrical
geometrical representation form of the DNA se- approach. Our theory really reflects the symmetry of
quence has been proposed by many authors (Hamori some realistic DNA sequences in a unified form. This
& Ruskin, 1983; Lathe & Findlay, 1984; Gate, 1985; theory provides a necessary complement to a great
Pickover, 1992; Zhang & Zhang, 1994). Among the amount of literature on the study of DNA sequences
geometrical representations of the DNA sequences, available now.
the Z curve proposed by Zhang & Zhang (1994) is
the most general one. All of the remaining
geometrical forms mentioned above are the special The present study was supported in part by the grant
cases of the Z curve (Zhang & Zhang, 1994). 39570187 from the China Natural Science Foundation.
As we have seen in this paper, the format of the
Z curve is based on the symmetry of the DNA
sequence. The symmetry is described by a cube and REFERENCES
its inscribed regular tetrahedron. We believe that C-B, A. (1985). Nomenclature for incompletely
specified bases in nucleic acid sequences: Recommendation 1984.
there must be some inherent connection between the Nucl. Acids. Res. 13, 30213030.
symmetry of the DNA sequence and that of the G, M. A. (1985). Simpler DNA sequence representations.
regular tetrahedron. In the present work, the Nature 316, 219.
G, W. (1991). Towards a Paradigm shift in biology. Nature
connection between the symmetry of the DNA 349, 99.
sequence and the rotational symmetry of the regular H, E. & R, J. (1983). H curves, A novel method of
tetrahedron has been established. The latter is representation of nucleotide series especially suitable for long
DNA sequences. J. Biol. Chem. 258, 13181327.
described by the tetrahedral group. Therefore, the L, R. & F, R. (1985). Novel DNA sequence
symmetry of the DNA sequence may be described by representation. Nature 314, 585586.
the DNA group, which is isomorphic to the M, Y. & B, C. J. (1992). An Algebraic Represen-
tation of RNA Secondary Structures. J. Biomol. Str. & Dyn. 10,
tetrahedral group. The matrix representation of the 465488.
DNA group has been worked out. Each rotational P, B. (1990). A Practical Guide to Molecular Cloning
operation of the tetrahedron is represented by a (2nd Edn). Chapter 5. New York: John Wiley & Sons.
P, C. A. (1992). DNA and Protein tetragrams: Biological
unique 3 3 matrix. At the same time, any DNA sequences as tetrahedral movements. J. Mol. Graphics 10, 26.
sequence can be represented by a 3 N matrix, T, L. E. H., R, G. W. & S, V. L. (1984). A
based on the format of the Z curve, where N is the Tetrahedral Representation of Poly-Codon Sequences and a
Possible Origin of Codon Degeneracy. J. theor. Biol. 108,
length of the sequence. The transform of the bases 459468.
under an operation of the DNA group corresponds T, E. N. & B, V. (1986). GNOMIC, A Dictionary of
to simply a matrix multiplication. Based on this Genetic Codes. Rehovot-Philadelphia: Balaban Publishers.
W, J.-H. & Z, C.-T. (1996). Study on the Isentropic
mathematical skill, the transforms listed in Tables 2 Equations of Nucleotide Sequences and Their Application. J.
4 are obtained. Such a mathematical format is not theor. Biol. 181, 197202.
306 .-.

Z, C.-T. & Z, Y. (1994). Analysis on the Distribution of Z, C.T. & Z, R. (1991b). Diagrammatic representation
Bases in 1487 Human Protein Coding Sequences. J. theor. Biol. of the distribution of DNA bases and its applications. Int. J.
167, 161165. Biol. Macromol. 13, 4549.
Z, C.-T. & Z, R. (1991a). Analysis of distribution of Z, R. & Z, C.-T. (1994). Z curves, An Intuitive Tool for
bases in the coding sequences by a diagrammatic technique. Visualizing and Analyzing the DNA sequences. J. Biomol. Stru.
Nucl. Acids Res. 19, 63136317. Dyn. 11, 767782.

You might also like