You are on page 1of 4

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO.

8, AUGUST 2005

InvMixColumn Decomposition and Multilevel Resource


Sharing in AES Implementations
Viktor Fischer, Milos Drutarovsk, Pawe Chodowiec, and
Franois Gramain
AbstractHardware implementations of cryptography face increasingly
more stringent demands for lower cost and greater computational power.
In order to meet those demands, more efcient approaches to implementations are needed. This paper presents detailed studies of MixColumn and
InvMixColumn operations used in Advanced Encryption Standard with
aim at their hardware implementations in constrained environments. Our
studies are supported by mathematical analysis of both transformations
and lead to efcient serial and parallel decompositions. Furthermore,
deeper resource sharing is demonstrated at word-, byte- and bit-level.
All derived architectures are evaluated using popular low-cost eld-programmable gate arrays. Application of proposed methods resulted in
reduction of recongurable logic area of the complete cipher by up to 20%.
Index TermsAdvanced encryption standard, cryptography, eld-programmable gate array (FPGA), hardware architectures, Rijndael, VLSI.

I. INTRODUCTION
In October 2000, the National Institute of Standards and Technology
selected the Rijndael cipher for the Advanced Encryption Standard
(AES) [1]. Early hardware implementations of the cipher [2], [3] were
straightforward and indicated relatively large requirements for circuit
area. Recent advances in research resulted in compact architecture for
AES S-boxes based on transformation of the original eld GF(28 ) into
a composite eld GF((24 )2 ) [4]. This approach is particularly useful
in the case of ASIC implementations [5], [6].
Cryptographic applications are typically based on application
specic integrated circuit (ASIC) technology as it is believed to
provide sufcient security level. However, recent advances in attacks
on implementations show that this preconceived idea is not necessarily valid. Static implementations based on ASICs are inherently
impossible to update or upgrade in response to new security threats.
On the other hand, eld-programmable gate array (FPGA) technology
has much greater potential for providing higher security level because
of its capability for dynamic reconguration [7]. We believe that
FPGA technology is an important future platform for cryptographic
applications.
FPGA implementations typically utilize embedded memory blocks
for implementation of S-boxes [8]. This approach achieves the best balance between utilization of embedded memory blocks and more versatile, thus more critical, recongurable logic. Therefore, in the case
of FPGAs, a signicant portion of the logic resources is consumed by
MixColumn and InvMixColumn implementations and their area optimization is crucial in constrained environments.
Manuscript received January 21, 2004; revised December 11, 2004 and April
4, 2005. This work was supported in part by the French national program ACI
Cryptologie (Project CR/02 2 0041) as a part of the project CryptArchi.
V. Fischer is with the Laboratoire Traitement du Signal et Instrumentation,
UMR CNRS 5516, Universit Jean Monnet, 42000 Saint-Etienne, France
(e-mail: scher@univ-st-etienne.fr).
M. Drutarovsk is with the Department of Electronics and Multimedia Communications, Technical University of Kosice, 041 20 Kosice, Slovak Republic
(e-mail: Milos.Drutarovsky@tuke.sk).
P. Chodowiec is with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA 22030 USA (e-mail:
pchodowi@ieee.org).
F. Gramain is with the Laboratoire dArithmtique et Algbre, Facult
des Sciences, Universit Jean Monnet, 42023 Saint-Etienne, France (e-mail:
gramain@univ-st-etienne.fr).
Digital Object Identier 10.1109/TVLSI.2005.853606

989

Most of the existing implementations of the AES address MixColumn and InvMixColumn separately. We found only a few
publications demonstrating potential for resource sharing between
MixColumn and InvMixColumn [6], [9][12]. In this paper, we analyze basic operations employed in MixColumn and InvMixColumn
uncovering several new possibilities for resource sharing on different
levels.
II. MIXCOLUMN AND INVMIXCOLUMN OPERATIONS
MixColumn constitutes one out of four operations used in AES encryption. InvMixColumn is an inverse operation to MixColumn used
in decryption. Both functions apply transformations at byte- and wordlevel, which are further explained in detail.
A. Byte-Level Operations
Elementary operations are dened at the byte level. Each byte is considered as a polynomial (of degree of at most 7) with coefcients in
Galois eld GF(2). A byte a(x) (or a in simplied notation) is a sum
a(x) = 0i7 i xi , where i 2 f0; 1g. In other words, bytes a are
elements of the Galois eld K = GF(28 ) constructed as the quotient

K=

GF(2)[x]

(x8 + x4 + x3 + x + 1)

(1)

Addition of polynomials in K corresponds to simple bit-wise XOR of


the polynomial coefcients. Multiplication of polynomials in the eld
K corresponds to their multiplication modulo irreducible polynomial
m(x) = x8 + x4 + x3 + x + 1. Multiplication of a polynomial a(x)
by x (in hexadecimal notation f02g) in K can be realized using the
so-called xtime() function: x 1 a(x) = xtime(a(x)) [1].
B. Word-Level Operations
MixColumn and InvMixColumn operations are dened over 4-byte
words that represent a column of the State (the State represents internal
state of transformed data expressed as a rectangular array of bytes)
[1]. Such 4-byte words are considered as polynomials (of degree of
at most 3) with coefcients in K = GF(28 ), dened in the ring of
polynomials K [X ] modulo X 4 + 1 and denoted as R = K [X ]=(X 4 +
1). Addition of such polynomials corresponds to bit-wise XOR of their
coefcients. Their multiplication is reduced modulo M (X ) = X 4 +1.
Since X j mod (X 4 + 1) = X j mod4 , the polynomial M (X ) simply
shifts rows of the State.
The MixColumn transformation multiplies each column of the State
by polynomial c(X ) in the ring R. The c(X ) is dened as follows:

c(X ) = f03gX 3 + X 2 + X + f02g:

(2)

The InvMixColumn transformation is the inverse of the MixColumn


operation. InvMixColumn multiplies each column of the State by

d(X ) = f0B gX 3 + f0DgX 2 + f09gX + f0E g


where d(X ) = c01 (X ) in R.
III. INVMIXCOLUMN DECOMPOSITION
MULTILEVEL RESOURCE SHARING

(3)

AND

From (2) and (3), we can see that coefcients of d(X ) are more
complex than coefcients of c(X ). As a result, hardware implementing
AES decryption is larger and slower than for encryption. In order to
reduce hardware cost, the InvMixColumn can be decomposed to share
logic resources with MixColumn. Since both functions transform
32-bit words, we will call this decomposition the word-level resource
sharing. There exist two possible decompositions of InvMixColumn:
parallel and serial. In addition to word-level sharing, resources can be

1063-8210/$20.00 2005 IEEE

990

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 8, AUGUST 2005

shared on a byte level and on a bit level as well. These approaches will
be discussed in Sections III-A and III-B.
A. Word-Level Resource Sharing
1) Parallel InvMixColumn Decomposition: Parallel InvMixColumn decomposition was rst proposed by J. Wolkerstorfer in [9].
It is based on the observation that d(X ) can be expressed using c(X )
in the following way:
d(X )

= c(X ) + e (X )

(4)

where e(X ) is an extension polynomial dened as


e (X )

3
2
= f08gX + f0C gX + f08gX + f0C g:

(5)

2) Serial InvMixColumn Decomposition: The inverse d(X ) of


the polynomial c(X ) in the ring R is given by the formula (see the
Appendix)
d(X )

01 (X ) = c3 (X )

(6)

which suggests that the InvMixColumn operation can be realized by


repeating MixColumn three times. For hardware implementations (6),
can be expressed as
d(X )

= c(X ) 1 f (X )

(7)

where
f (X )

2
(X ) = f04gX + f05g:

(8)

Therefore, the InvMixColumn function can be implemented using the


MixColumn function and the f (X ) polynomial. Comparing c(X ) and
f (X ) we see that f (X ) has much simpler implementation than c(X )
because two of its coefcients are zero.
B. Byte-Level and Bit-Level Resource Sharing
Multiplication in algebraic elds is distributive over addition. This
property enables byte-level resource sharing. Bit-level resource sharing
can further improve logic reduction and hardware compilers usually do
it efciently.
1) Byte-Level Resource Sharing in MixColumn Implementation:
The rst byte of the MixColumn implementation based on byte-level
resource sharing [1] (see dashed rectangles in Fig. 1) is expressed as
follows:
b0

= f02ga0 + f03ga1 + a2 + a3
= (a0 + a1 + a2 + a3 ) + f02g(a0 + a1 ) + a0 :

(9)

The bold line in the dashed rectangles in Fig. 1 shows that the term a0 +
a1 + a2 + a3 can be shared by all four bytes of the MixColumn function. Note, that both addition and subtraction operations in GF(28 ) are
realized by the bit-wise addition modulo 2 (XOR). We use the symbol
(8) for this operation in Fig. 1.
2) Byte-Level Resource Sharing in InvMixColumn Implementation:
Byte-level resource sharing is possible in both serial and parallel InvMixColumn decompositions. In the serial decomposition, functions
c(X ) and f (X ) have to be optimized separately. Since c(X ) corresponds to the MixColumn function, it can be implemented in a way
described in Section III-B.1. For the f (X ) function implementation,
based on the byte-level sharing, we propose expression of the rst byte
in the following way:
b0
0

= f05gb0 + f04gb2 = f04g


0

b0

+ b2 + b0

Fig. 1. MixColumn and InvMixColumn implementation based on


(a) InvMixColumn serial and (b) parallel decomposition and byte-level
resource sharing.

share the term f04g(b00 + b20 ) from the previous equation. The same
approach can be used in the expression for the second and the fourth
byte.
In the parallel InvMixColumn decomposition, the polynomial c(X )
and the extension polynomial e(X ) can be optimized either jointly or
separately. In [10], the rst byte of the InvMixColumn function is given
as follows:

(10)

where b0 and b2 are the rst and the third output bytes of the MixColumn function. It can be seen [bold lines in the lower part of
Fig. 1(a)] that the rst and the third byte of the f (X ) function can

b0

= f0E ga0 + f0Bga1 + f0Dga2 + f09ga3


= f02g(a0 + a1 ) + a1 + a2 + a3 + f04g(f02g(a0 + a1 )
+ f02g(a2 + a3 ) + (a0 + a2 ))

(11)

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 8, AUGUST 2005

991

TABLE I
IMPLEMENTATION RESULTS FOR ALL ARCHITECTURES IMPLEMENTED IN XILINX SPARTAN2S 30-PQ208 DEVICE

where the term f02g(a0 + a1 ) + a1 + a2 + a3 represents the rst


byte of the MixColumn function, and the term f04g(f02g(a0 + a1 ) +
f02g(a2 + a3 ) + (a0 + a2 )) represents the rst byte of the extension
function. Note, that the term f02g(a0 + a1 ) can be shared between the
rst output byte of the MixColumn function and the rst byte and the
third byte of the extension function.
In order to further reduce the area of the InvMixColumn function, we
propose another solution based on the parallel InvMixColumn decomposition. In this case, the rst byte of the InvMixColumn is expressed
as follows:

b0 = f0E ga0 + f0B ga1 + f0Dga2 + f09ga3


= (a0 + a1 + a2 + a3 ) + f02g(a0 + a1 ) + a0
+ f02g(f04g(a0 + a2 ) + f04g(a1 + a3 ))
+ f04g(a0 + a2 )
(12)
where the term (a0 + a1 + a2 + a3 )+ f02g(a0 + a1 )+ a0 represents the
rst byte of the MixColumn function, and the term f02g(f04g(a0 +
a2 ) + f04g(a1 + a3 )) + f04g(a0 + a2 ) represents the rst byte of the
extension function. Thus, this conguration enables concurrent MixColumn and InvMixColumn implementation. It can be seen that all four
output bytes of the extension function share the term f02g(f04g(a0 +
a2 ) + f04g(a1 + a3 )). Furthermore, the terms f04g(a0 + a2 ) and
f04g(a1 + a3 ) are shared by the rst and the third output bytes and
by the second and the fourth output bytes, respectively, as shown in
Fig. 1(b).
IV. HARDWARE IMPLEMENTATIONS OF THE SHARED MIXCOLUMN
AND INVMIXCOLUMN FUNCTIONS
In order to evaluate all discussed architectures, we created their
implementations in Xilinx Spartan2s30 FPGA using Synplify PRO
7.7.1 and Xilinx ISE 6.2i tools. Implementation results are collected
in Table I under the following abbreviations:
bothmixcol_n
follows directly (2) and (3) with no resource
sharing nor any decomposition attempted;
bothmixcol_p
follows parallel decomposition by J. Wolkerstorfer [9] with no byte-level resource sharing;
bothmixcol_s
follows our new serial decomposition from (7)
with no byte-level resource sharing;
bothmixcol_pb1 derived from bothmixcol_p but with the bytelevel sharing as described by C-C. Lu et al. [10].
This architecture is specied by (11);
bothmixcol_pb2 follows our new parallel decomposition with
byte-level sharing shown in Fig. 1(b) and is
specied by (12);
bothmixcol_sb
derived from bothmixcol_s but with byte-level
resource sharing from Fig. 1(a).
A. MixColumn/InvMixColumn Implementation Results
Implementations of MixColumn/InvMixColumn circuits are characterized by the circuit area they occupy [number of lookup tables

(LUTs)] and propagation delay tPD obtained by the timing analysis


tool after placement and routing. Results collected in Table I clearly
indicate that the bothmixcol_n architecture occupies signicantly more
space than architectures based on parallel and serial InvMixColumn
decomposition even without byte-level sharing: see bothmixcol_p and
bothmixcol_s. It appears that bothmixcol_sb is the most area-efcient
and bothmixcol_pb2 is the fastest solution in FPGA.
Parallel decomposition by Lu et al. [10] represents the most compact decomposition published to date. Quick inspection of their circuit shows that it requires 304 2-input XOR gates. Our best parallel decomposition bothmixcol_pb2 requires 219 XOR gates, which is 28%
smaller. Our serial decomposition bothmixcol_sb requires only 192
XOR gates, which constitutes reduction by 37% compared to [10].
B. Complete AES Cipher Implementation Results
We provide also results of implementation for the full AES cipher
with all MixColumn/InvMixColumn architectures. The AES design
was specically crafted for embedded systems and is currently one
of the smallest published AES implementations [12]. Note that this
implementation maps S-boxes into memory elements called BRAMs.
The effects of discussed resource sharing methods on the overall cipher
implementation are indicated in the last column of the Table I by percentage of area savings compared to straightforward implementation. It
appears that usage of bothmixcol_sb results in smallest circuit reducing
the number of required LUTs by over 20%. Interestingly, performance
variations among implementations are negligibly small.

V. CONCLUSION
In this paper, we have shown a new relationship between MixColumn and InvMixColumn which enables efcient resource sharing
between both operations. Following this new approach, we introduced
two new representations for the InvMixColumn based on parallel
and serial decompositions. Both of them, and especially new serial
decomposition, enable very efcient resource sharing on the word
level.
Furthermore, we demonstrated that efcient resource sharing
based on these decompositions can be further enhanced by a byteand bit-level resource sharing. We proposed a new method for
byte-level resource sharing for both parallel and serial InvMixColumn
decomposition.
We have shown that the proposed architecture based on the serial
InvMixColumn decomposition with byte-level resource sharing is
the most area-efcient solution between all tested architectures. The
second proposed solution based on the parallel InvMixColumn decomposition is the most area-efcient between parallel architectures.
Both of these architectures are very useful in the case when the area of
the ciphering/deciphering unit is a limiting factor (e.g., [12]). We have
also demonstrated practical benets of using our solutions in full AES
implementation showing dramatic circuit area savings with marginal
performance variations.

992

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 8, AUGUST 2005

APPENDIX
PROOF OF (6)

A 32-Bit Carry Lookahead Adder


Using Dual-Path All-N Logic

Proof: If a polynomial P (X ) = 0j 3 aj X j is in K [X ] [dened in (1)] then its fourth power is equal to

Ge Yang, Seong-Ook Jung, Kwang-Hyun Baek, Soo Hwan Kim,


Suki Kim, and Sung-Mo Kang

(X ) =

0j 3

aj X

4j

(13)

because K is a eld of characteristic 2. Since X 4j mod (X 4 +1) = 1,


in R we get X 4j = 1 and from (13) we obtain
P

(X ) =

0j 3

aj

(1):

(14)

From (2), it follows that


c(1)

AbstractWe have developed dual path all-N logic (DPANL) and applied
it to 32-bit adder design for higher performance. The speed is signicantly
enhanced due to reduced capacitance at each evaluation node of dynamic
circuits. The power saving is achieved due to reduced adder cell size and
minimal race problem. Post-layout simulation results show that this adder
can operate at frequencies up to 1.85 GHz for 0.35- m 1P4M CMOS technology and is 32.4% faster than the adder using all-N transistor (ANT). It
also consumes 29.2% less power than the ANT adder. A 0.35- m CMOS
chip has been fabricated and tested to verify the functionality and performance of the DPANL adder on silicon.
Index TermsCMOS, dynamic-logic circuit, high performance,
low-power design.

= (x + 1) + 1 + 1 + x = 1

and, therefore, from (14), we get


c

I. INTRODUCTION
3

(X ) = 1 = c(X ) 1 c (X ) = c(X ) 1 d(X )

that completes the proof of (6).

REFERENCES
[1] FIPS 197: Advanced Encryption Standard, 2001.
[2] K. Gaj and P. Chodowiec, Comparison of the hardware performance
of the AES candidates using recongurable hardware, in Proc. 3rd Advanced Encryption Standard Candidate Conf. (AES3), New York, Apr.
2000, pp. 4054.
[3] A. J. Elbirt, W. Yip, B. Chetwynd, and C. Paar, An FPGA implementation and performance evaluation of the AES block cipher candidate
algorithm nalists, in Proc. 3rd Advanced Encryption Standard Candidate Conf. (AES3), New York, Apr. 2000, pp. 1327.
[4] V. Rijmen. Efcient implementation of the Rijndael S-box. [Online].
Available: http://www.esat.kuleuven.ac.be/~rijmen/rijndael/sbox.pdf
[5] A. Rudra, P. K. Dubey, C. S. Jutla, V. Kumar, J. Rao, and P. Rohatgi, Efcient Rijndael encryption implementation with composite eld arithmetic, in Proc. Int. Workshop Cryptographic Hardware and Embedded
Systems (CHES01), vol. 2161, 2001, pp. 171184.
[6] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, A compact Rijndael
hardware architecture with S-box optimization, in Proc. Theory and
Application of Cryptology and Information Security (ASIACRYPT01),
vol. 2248, Gold Coast, Australia, Dec. 913, 2001, pp. 239254.
[7] P. Davies. Thales e-Security white paper: Flexible security. [Online]. Available: http://www.thales-esecurity.com/Whitepapers/documents/WP_Flexible_Security.pdf
[8] V. Fischer and M. Drutarovsk, Two methods of Rijndael implementation in recongurable hardware, in Proc. Int. Workshop on Cryptographic Hardware and Embedded Systems (CHES01), vol. 2162, Paris,
France, May 2001, pp. 8196.
[9] J. Wolkerstorfer, An ASIC implementation of the AES MixColumn operation, in Proc. Austrochip 2001, Vienna, Austria, Oct. 12, 2001, pp.
129132.
[10] C.-C. Lu and S.-Y. Tseng, Integrated design of AES (advanced encryption standard) encrypter and decrypter, in Proc. IEEE Int. Conf. Application-Specic Systems, Architectures and Processors (ASAP02), 2002,
pp. 277285.
[11] X. Zhang and K. K. Parhi, Implementation approaches for the advanced
encryption standard algorithm, IEEE Circuits Syst. Mag., vol. 2, no. 4,
pp. 2446, Mar. 2002.
[12] P. Chodowiec and K. Gaj, Very compact FPGA implementation of the
AES algorithm, in Proc. Int. Workshop on Cryptographic Hardware
and Embedded Systems (CHES03), vol. 2779, Cologne, Germany, Sep.
2003, pp. 319333.

Much work has been done recently on high-performance low-power


adder design critical for microprocessors [1][3]. Dynamic circuits
have been widely used owning to faster switching speed and less area
than the conventional static CMOS circuits. Pipelined structure has
also been used to further enhance the operating frequency to achieve
higher throughput.
In pipelined systems of NORA [4], ZIPPER [5], and TSPC [6], lowspeed pMOS logic blocks are used. For speed improvement, all-N logic
(ANL) [7] was introduced to use only high-speed nMOS logic in all
stages. All-N transistor (ANT) [1] was developed by using a feedback
transistor pair to improve the performance of ANL. For further speed
improvement with reduced power consumption, we propose dual-path
all-N logic (DPANL).
This paper is organized as following. Section II reviews previous
work. Section III introduces DPANL. Simulation and chip testing results are shown in Section IV, followed by the conclusion in Section V.
II. PREVIOUS WORK
NORA uses two-phase clock signals instead of four-phase clock signals and avoids the race problem caused by clock skews with constrained logic composition [6]. True single-phase clock (TSPC) uses
only a single-phase clock without inversion. It does not suffer from the
clock skew problems and thus can operate at high clock frequency [7].
Both NORA and TSPC pipelined systems have the drawback of using
low speed pMOS logic blocks that limit the performance of pipelined
systems.
Fig. 1 shows a circuit diagram of the CMOS dynamic circuit ANL.
It removes the drawback of TSPC logic by using an nMOS logic tree
in N2-block. To overcome the voltage drop problem in the nMOS logic
tree, a positive feedback pMOS P3 in N2-block is used to pull up the
Manuscript received January 21, 2004; revised December 14, 2004. This
work was supported in part by Semiconductor Research Corporation under
Contract 2001-HJ-891, in part by Intel Corporation, and in part by BK21
program.
G. Yang is with the Nvidia Corporation, Santa Clara, CA 95050 USA (e-mail:
gyang@nvidia.com).
S.-O. Jung is with Qualcomm Inc., San Diego, CA 92121 USA.
K.-H. Baek is with Rockwell Scientic, Thousand Oaks, CA 91360 USA.
S. H. Kim and S. Kim are with Korea University, Seoul 136-701, Korea.
S.-M. Kang is with the University of California, Santa Cruz, CA 95064 USA.
Digital Object Identier 10.1109/TVLSI.2005.853605

1063-8210/$20.00 2005 IEEE

You might also like