Professional Documents
Culture Documents
8, AUGUST 2005
I. INTRODUCTION
In October 2000, the National Institute of Standards and Technology
selected the Rijndael cipher for the Advanced Encryption Standard
(AES) [1]. Early hardware implementations of the cipher [2], [3] were
straightforward and indicated relatively large requirements for circuit
area. Recent advances in research resulted in compact architecture for
AES S-boxes based on transformation of the original eld GF(28 ) into
a composite eld GF((24 )2 ) [4]. This approach is particularly useful
in the case of ASIC implementations [5], [6].
Cryptographic applications are typically based on application
specic integrated circuit (ASIC) technology as it is believed to
provide sufcient security level. However, recent advances in attacks
on implementations show that this preconceived idea is not necessarily valid. Static implementations based on ASICs are inherently
impossible to update or upgrade in response to new security threats.
On the other hand, eld-programmable gate array (FPGA) technology
has much greater potential for providing higher security level because
of its capability for dynamic reconguration [7]. We believe that
FPGA technology is an important future platform for cryptographic
applications.
FPGA implementations typically utilize embedded memory blocks
for implementation of S-boxes [8]. This approach achieves the best balance between utilization of embedded memory blocks and more versatile, thus more critical, recongurable logic. Therefore, in the case
of FPGAs, a signicant portion of the logic resources is consumed by
MixColumn and InvMixColumn implementations and their area optimization is crucial in constrained environments.
Manuscript received January 21, 2004; revised December 11, 2004 and April
4, 2005. This work was supported in part by the French national program ACI
Cryptologie (Project CR/02 2 0041) as a part of the project CryptArchi.
V. Fischer is with the Laboratoire Traitement du Signal et Instrumentation,
UMR CNRS 5516, Universit Jean Monnet, 42000 Saint-Etienne, France
(e-mail: scher@univ-st-etienne.fr).
M. Drutarovsk is with the Department of Electronics and Multimedia Communications, Technical University of Kosice, 041 20 Kosice, Slovak Republic
(e-mail: Milos.Drutarovsky@tuke.sk).
P. Chodowiec is with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA 22030 USA (e-mail:
pchodowi@ieee.org).
F. Gramain is with the Laboratoire dArithmtique et Algbre, Facult
des Sciences, Universit Jean Monnet, 42023 Saint-Etienne, France (e-mail:
gramain@univ-st-etienne.fr).
Digital Object Identier 10.1109/TVLSI.2005.853606
989
Most of the existing implementations of the AES address MixColumn and InvMixColumn separately. We found only a few
publications demonstrating potential for resource sharing between
MixColumn and InvMixColumn [6], [9][12]. In this paper, we analyze basic operations employed in MixColumn and InvMixColumn
uncovering several new possibilities for resource sharing on different
levels.
II. MIXCOLUMN AND INVMIXCOLUMN OPERATIONS
MixColumn constitutes one out of four operations used in AES encryption. InvMixColumn is an inverse operation to MixColumn used
in decryption. Both functions apply transformations at byte- and wordlevel, which are further explained in detail.
A. Byte-Level Operations
Elementary operations are dened at the byte level. Each byte is considered as a polynomial (of degree of at most 7) with coefcients in
Galois eld GF(2). A byte a(x) (or a in simplied notation) is a sum
a(x) = 0i7 i xi , where i 2 f0; 1g. In other words, bytes a are
elements of the Galois eld K = GF(28 ) constructed as the quotient
K=
GF(2)[x]
(x8 + x4 + x3 + x + 1)
(1)
(2)
(3)
AND
From (2) and (3), we can see that coefcients of d(X ) are more
complex than coefcients of c(X ). As a result, hardware implementing
AES decryption is larger and slower than for encryption. In order to
reduce hardware cost, the InvMixColumn can be decomposed to share
logic resources with MixColumn. Since both functions transform
32-bit words, we will call this decomposition the word-level resource
sharing. There exist two possible decompositions of InvMixColumn:
parallel and serial. In addition to word-level sharing, resources can be
990
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 8, AUGUST 2005
shared on a byte level and on a bit level as well. These approaches will
be discussed in Sections III-A and III-B.
A. Word-Level Resource Sharing
1) Parallel InvMixColumn Decomposition: Parallel InvMixColumn decomposition was rst proposed by J. Wolkerstorfer in [9].
It is based on the observation that d(X ) can be expressed using c(X )
in the following way:
d(X )
= c(X ) + e (X )
(4)
3
2
= f08gX + f0C gX + f08gX + f0C g:
(5)
01 (X ) = c3 (X )
(6)
= c(X ) 1 f (X )
(7)
where
f (X )
2
(X ) = f04gX + f05g:
(8)
= f02ga0 + f03ga1 + a2 + a3
= (a0 + a1 + a2 + a3 ) + f02g(a0 + a1 ) + a0 :
(9)
The bold line in the dashed rectangles in Fig. 1 shows that the term a0 +
a1 + a2 + a3 can be shared by all four bytes of the MixColumn function. Note, that both addition and subtraction operations in GF(28 ) are
realized by the bit-wise addition modulo 2 (XOR). We use the symbol
(8) for this operation in Fig. 1.
2) Byte-Level Resource Sharing in InvMixColumn Implementation:
Byte-level resource sharing is possible in both serial and parallel InvMixColumn decompositions. In the serial decomposition, functions
c(X ) and f (X ) have to be optimized separately. Since c(X ) corresponds to the MixColumn function, it can be implemented in a way
described in Section III-B.1. For the f (X ) function implementation,
based on the byte-level sharing, we propose expression of the rst byte
in the following way:
b0
0
b0
+ b2 + b0
share the term f04g(b00 + b20 ) from the previous equation. The same
approach can be used in the expression for the second and the fourth
byte.
In the parallel InvMixColumn decomposition, the polynomial c(X )
and the extension polynomial e(X ) can be optimized either jointly or
separately. In [10], the rst byte of the InvMixColumn function is given
as follows:
(10)
where b0 and b2 are the rst and the third output bytes of the MixColumn function. It can be seen [bold lines in the lower part of
Fig. 1(a)] that the rst and the third byte of the f (X ) function can
b0
(11)
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 8, AUGUST 2005
991
TABLE I
IMPLEMENTATION RESULTS FOR ALL ARCHITECTURES IMPLEMENTED IN XILINX SPARTAN2S 30-PQ208 DEVICE
V. CONCLUSION
In this paper, we have shown a new relationship between MixColumn and InvMixColumn which enables efcient resource sharing
between both operations. Following this new approach, we introduced
two new representations for the InvMixColumn based on parallel
and serial decompositions. Both of them, and especially new serial
decomposition, enable very efcient resource sharing on the word
level.
Furthermore, we demonstrated that efcient resource sharing
based on these decompositions can be further enhanced by a byteand bit-level resource sharing. We proposed a new method for
byte-level resource sharing for both parallel and serial InvMixColumn
decomposition.
We have shown that the proposed architecture based on the serial
InvMixColumn decomposition with byte-level resource sharing is
the most area-efcient solution between all tested architectures. The
second proposed solution based on the parallel InvMixColumn decomposition is the most area-efcient between parallel architectures.
Both of these architectures are very useful in the case when the area of
the ciphering/deciphering unit is a limiting factor (e.g., [12]). We have
also demonstrated practical benets of using our solutions in full AES
implementation showing dramatic circuit area savings with marginal
performance variations.
992
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 8, AUGUST 2005
APPENDIX
PROOF OF (6)
Proof: If a polynomial P (X ) = 0j 3 aj X j is in K [X ] [dened in (1)] then its fourth power is equal to
(X ) =
0j 3
aj X
4j
(13)
(X ) =
0j 3
aj
(1):
(14)
AbstractWe have developed dual path all-N logic (DPANL) and applied
it to 32-bit adder design for higher performance. The speed is signicantly
enhanced due to reduced capacitance at each evaluation node of dynamic
circuits. The power saving is achieved due to reduced adder cell size and
minimal race problem. Post-layout simulation results show that this adder
can operate at frequencies up to 1.85 GHz for 0.35- m 1P4M CMOS technology and is 32.4% faster than the adder using all-N transistor (ANT). It
also consumes 29.2% less power than the ANT adder. A 0.35- m CMOS
chip has been fabricated and tested to verify the functionality and performance of the DPANL adder on silicon.
Index TermsCMOS, dynamic-logic circuit, high performance,
low-power design.
= (x + 1) + 1 + 1 + x = 1
I. INTRODUCTION
3
REFERENCES
[1] FIPS 197: Advanced Encryption Standard, 2001.
[2] K. Gaj and P. Chodowiec, Comparison of the hardware performance
of the AES candidates using recongurable hardware, in Proc. 3rd Advanced Encryption Standard Candidate Conf. (AES3), New York, Apr.
2000, pp. 4054.
[3] A. J. Elbirt, W. Yip, B. Chetwynd, and C. Paar, An FPGA implementation and performance evaluation of the AES block cipher candidate
algorithm nalists, in Proc. 3rd Advanced Encryption Standard Candidate Conf. (AES3), New York, Apr. 2000, pp. 1327.
[4] V. Rijmen. Efcient implementation of the Rijndael S-box. [Online].
Available: http://www.esat.kuleuven.ac.be/~rijmen/rijndael/sbox.pdf
[5] A. Rudra, P. K. Dubey, C. S. Jutla, V. Kumar, J. Rao, and P. Rohatgi, Efcient Rijndael encryption implementation with composite eld arithmetic, in Proc. Int. Workshop Cryptographic Hardware and Embedded
Systems (CHES01), vol. 2161, 2001, pp. 171184.
[6] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, A compact Rijndael
hardware architecture with S-box optimization, in Proc. Theory and
Application of Cryptology and Information Security (ASIACRYPT01),
vol. 2248, Gold Coast, Australia, Dec. 913, 2001, pp. 239254.
[7] P. Davies. Thales e-Security white paper: Flexible security. [Online]. Available: http://www.thales-esecurity.com/Whitepapers/documents/WP_Flexible_Security.pdf
[8] V. Fischer and M. Drutarovsk, Two methods of Rijndael implementation in recongurable hardware, in Proc. Int. Workshop on Cryptographic Hardware and Embedded Systems (CHES01), vol. 2162, Paris,
France, May 2001, pp. 8196.
[9] J. Wolkerstorfer, An ASIC implementation of the AES MixColumn operation, in Proc. Austrochip 2001, Vienna, Austria, Oct. 12, 2001, pp.
129132.
[10] C.-C. Lu and S.-Y. Tseng, Integrated design of AES (advanced encryption standard) encrypter and decrypter, in Proc. IEEE Int. Conf. Application-Specic Systems, Architectures and Processors (ASAP02), 2002,
pp. 277285.
[11] X. Zhang and K. K. Parhi, Implementation approaches for the advanced
encryption standard algorithm, IEEE Circuits Syst. Mag., vol. 2, no. 4,
pp. 2446, Mar. 2002.
[12] P. Chodowiec and K. Gaj, Very compact FPGA implementation of the
AES algorithm, in Proc. Int. Workshop on Cryptographic Hardware
and Embedded Systems (CHES03), vol. 2779, Cologne, Germany, Sep.
2003, pp. 319333.