You are on page 1of 12

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO.

5, MAY 2011

Low Latency GF (2m) Polynomial Basis Multiplier


Jos Luis Imaa

935

AbstractFinite eld (2 ) arithmetic is becoming increasingly important for a variety of different applications including cryptography, coding theory and computer algebra. Among nite eld arithmetic operations, (2 ) multiplication is of special interest because it is considered the most important building block. This contribution describes a new low latency parallel-in/parallel-out sequential polynomial basis multiplier over (2 ). For irreducible (2 ) generating polynomials 1+ ( )= + + + + 1 with 2 1, the proposed multiplier has a theoretical latency of 2 + 1 cycles. This latency is the lowest one found in the literature for (2 ) 2 1 is spemultipliers. Furthermore, the condition cially important because the ve binary irreducible polynomials recommended by NIST for elliptic curve cryptography (ECC) implementation verify this condition. Index TermsFinite elds, implementation, multiplication, polynomial basis, VLSI.

I. INTRODUCTION

INITE elds play an important role in many applications, particularly in coding theory [17], [18], computer algebra and cryptography of elliptic curves. Furthermore, efcient arithmetic operations hardware implementations of are highly desirable, especially for multiplication because exponentiation, inversion and division can be computed by repeated multiplications [1], [10], [11]. Many approaches and multiarchitectures have been proposed to perform plication with different bases of representation like polynomial (standard), normal or dual basis. However, polynomial basis multipliers are more efcient and more widely used compared with multipliers based on normal or dual basis. When using a polynomial basis representation, any element of the eld can be expressed as a binary polynomial of degree at most , and the multiplication of the eld elements is performed modulo an irreducible generating polynomial of degree . The polynomial basis multiplication requires a polynomial multiplication followed by a modular reduction. In practice, these two steps can be combined. Mastrovito [19] proposed a new method for multiplication where a product matrix was introduced to combine the above two steps together. Bit-parallel combinational Mastrovito multiplier has been studied for several irreducible polynomials.
Manuscript received June 02, 2010; revised August 18, 2010; accepted September 20, 2010. Date of publication November 18, 2010; date of current version April 27, 2011. This work was supported by the Spanish Government under Research Grant CICYT TIN2008-00508. This paper was recommended by Associate Editor C. H. Chang. The author is with the Department of Computer Architecture and Systems Engineering, Faculty of Physics, Complutense University, 28040 Madrid, Spain (e-mail: jluimana@dacya.ucm.es). Digital Object Identier 10.1109/TCSI.2010.2089553

A number of different architectures for polynomial basis have been reported, like nite eld multiplication over bit-parallel, bit-serial and digit-serial multipliers. Bit-parallel multipliers [2], [7], [21] have very high area complexity, but perform the multiplication in one clock cycle. Bit-serial multipliers clock [3], [5], [6], [12] are restricted in area but need many cycles. Digit-serial architectures [4], [22] are a tradeoff between speed and area. This is achieved by processing several operands coefcients (digit-size) at the same time. Systolic [16], [20], semisystolic [3], [9], [23], [24] and pipelined [3], [20] designs have been also proposed by many researchers. The choice of can inuence the complexity the irreducible polynomial (area, delay) of the multiplier. Special irreducible polynomials such as trinomials or pentanomials can provide signicant optimizations in terms of both area and speed. However, this solution leads to multipliers that perform calculations in a specic eld (with xed eld degree ) determined by the particular polynomial used, and can not work in any other eld value and with different irreducible polynowith different mial. To solve this problem, versatile multipliers are designed eld with given value, but can also for a specic elds dened perform multiplications in all underlying . over any other irreducible polynomial, where Versatile multipliers can be reused in systems employing more elds, therefore saving extra hardware and than one offering increased exibility. In this paper, a new low latency parallel-in/parallel-out seis presented. quential polynomial basis multiplier over generating polynomials For irreducible with , the proposed mul. This latency is tiplier has a theoretical latency of the lowest one found in the literature for multipliers. Furthermore, the proposed multiplier is partially versatile in the , sense that the datapath can be used for nite elds , achieving the low latency if the constraint with is fullled. The paper is organized as follows. Notation and mathematical background are given in Section II. The new low-latency multiplier is presented in Section III, where the conditions for the low-latency are deduced, the architecture of the multiplier is presented, and complexity analysis and comparisons with other multipliers are given. Finally, Section IV concludes the paper. II. NOTATION AND PRELIMINARIES The new multiplier architecture presented in this paper is based on the transpositional method given in [7]. In this section, we review this method and give two examples that will be used to introduce the new multiplier. be a monic irreducible polynomial Let of degree , and let be a root of over

1549-8328/$26.00 2010 IEEE

936

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

. Then, the set is referred to as the are polynopolynomial basis. In , the elements of over , and arithmetic is mials of degree at most carried out modulo . An element can be represented in as , where s are the coordinates. We can denote as the vector of coordinates of with respect to , i.e., . and , , be their coordinate vecLet , , tors, respectively, in the polynomial basis. Using the transpositional multiplication method [7], [8], the product can be performed as , where are the inverted coordinates of . The coordinates . The of the product are given as sum-of-products over depends on the product matrix (or Mastrovito matrix [19]) irreducible polynomial selected and on the coordinates of in the polynomial basis. The decomposition of in a sum of matrices [7] determines groups of subexpressions that can be shared among the coordinates. The number and structure of the matrices obtained from the decomposition of depend on the generating irreducible polynomial. For an irreducible polyno, the mamial trix can be decomposed in a sum of matrices which number decorresponding to the pends on the values of non null coefcients , respectively, of the selected polynomial. The decomposition of in a sum of matrices is given as (1)

is made up of signicant column of rows with coefcients. of

null rows and

. . .

..

. . .

. . .

. . .

. . .

. . .

..

. . . (3)

. . .

..

. . .

. . .

. . .

. . .

. . .

..

. . .

The existence of matrices , in (1) depends on the position of the th null column in . In general, for any with , it is veried that if the th column from presents coefcients in its lower rows, then a matrix exists, where

and are constructed from as follows: If , consists on the rst columns from , while the remainder columns from are the rst columns of , completed with null columns (in this case, there will be another matrix); if , the rst columns of are the non null columns from completed with null columns, while has null all its columns (in this case, no more matrices exist). , with , are The matrices as follows. The th column and the constructed from submatrix of have both null all their elements, submatrix is the submatrix with addiwhereas the tional null columns on the left (if necessary). Using (1), the product coordinates are computed by adding , s and s vectors. From these, the , , always appear in the , with independence of the summation given by values of the selected polynomial. Moreover, is common to any irreducible polynomial. The matrix dened in (2) consists on columns formed by the s and s are successive rotation of , while that the s formed by successive shift of . The components of and s consist of sum-of-products that also appear in , so common subexpressions grouped in sum-of-products can be determined. A new notation and extracted from for the computation of s and s was presented in [7]. These vectors consist on sum-of-products given by the inner products of and . An inner product can be represented by the permutation given by the subscripts of the coordinates of and , respectively, in the sum of products. and 2-cycles , can From this permutation, 1-cycles be found and associated with the terms and , respectively. The 2-cycles are called in group theory as transpositions. For example, the sum of can products be represented by the cycles (0,6)(1,5)(2,4)(3). The functions , , and for the computation of the (1,2)-cycles

and where the where matrix is common to any irreducible polynomial, so the subexcan be used by different multipliers based pressions given by on different polynomials. The matrix is given in (2), where , , are the coordinates of in [7].

. . .

. . .

..

. . .

(2)

From (1), we have that for any polynomial there exists, at least, one matrix for each , as given in (3). The th column (numbered from 1 to , from right is null, where corresponds with the non null to left) of coefcient of the pentanomial. The th column divides into two submatrices. The rst one is made up of the columns located to the left of the null column and the columns to the right of second one is made up of the submatrix as the th column. We denote to the left and to the right submatrix as , so can be represented as . The submatrices and are constructed by the consecutive shift of their less signicant columns one row down with zero insertion. The most is made up of zeros in the upper signicant column of rows and of not null rows with coefcients, while the less

IMAA: LOW LATENCY

POLYNOMIAL BASIS MULTIPLIER

937

FUNCTIONS

SC , EC

TABLE I

AND FOR THE COMPUTATION OF THE 1-CYCLES AND 2-CYCLES

OC

. where the computation starts with the knowledge of vector is obtained. The The recursion stops when a zero last non zero vector is . The th components of , with are

(8) are given in Table I. The function (for even and odd) and , respectively, in [7], and corresponds with and correspond with and , respectively, in [7]. These functions determine the subexpressions grouping characteristic of the transpositional method. This is performed and which carry out the sum dening the functions , and terms represented by the cycles given by of the , and , respectively. vector is made from the sum of , and The terms. Moreover, the subexpressions given by and cons and s. The stitute the components of all the vector is common to any polynomial, but the composis and s , for , tion of depend on the selected polynomial. Using , , and , the th components of are odd i even where The (4) The decomposition of is given in (10) using (9), where the th elements of the matrices represent the coordinates of in , the symbol represents null elements and the matrices are constructed as previously given. Using (1), and (4)(8), the coordinates can be determined. A. Multiplication Over of the product

for Two Pentanomials

The multiplication for two generating pentanomials over are considered in this subsection. Their analyses and comparison will lead in Section III to the presentation of a new low-latency multiplier. Let be the product of two elements and from generated by the irreducible pentanomial . This pentanomial is especially important because it is used in the Advanced Encryption Standard (AES). For this and , and . The polynomial, substitution of the values , obtained for this pentanomial in (1) gives the following decomposition for :

(9)

. s and s vectors (for ) consist exclusively of sums of and functions . The expressions of that are among the components of , for even odd the th components of values of are given in (5) and (6), where the symbol indicates that, for odd , the and functions must be exchanged, and . For even odd or for that is, odd even , with , the expressions are (5)

and for odd

odd

or even

even

, we have (6)

The of

recursive computation , is

of the

th components , with

(7)

938

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

TABLE II CYCLES AND SUMS-OF-PRODUCTS FOR THE S , E AND O FUNCTIONS FOR

GF (2 )

COORDINATES

OF THE

TABLE III + PRODUCT FOR THE PENTANOMIALS ( ) =

fx

x + 1 AND f (x) = x

x+1

(10)

Using Table I, cycles , , and are given in the second column of Table II, where the sums of terms represented by these cycles are also given by , , and , respectively. The product coordinates for this example are given in the left side of Table III, where a coordinate is the sum of the terms , , and/or existent in the th row. The terms , , , , and in Table III represent the vectors , , , , and , respectively, and the corresponding columns give their components. Furthermore, the , , , and are distributed in columns for the terms three groups , , and corresponding to , , and for the irreducible pentanomials. Table III is divided in three blocks of columns. The rst block that is common to any irreducible represent the vector polynomial over . The second block includes the terms corresponding to the irreducible pentanomial , and the third block corresponds to the irreducible . pentanomial The sharing property for can be noted in Table III, where the group can be found in the coordinates , , , and ; the group can be found in , , , and ; and the

can be found in , , , and . From that, group it can be observed in Table III that the dark shadowed terms in block and at upper-left for block can be found in the three groups of columns , , corresponding to , , and . These and terms are light shadowed at the bottom for the corresponding pentanomial block. Furthermore, it can be observed that these terms start at the rows for , , and corresponding with the values , , and , respectively. generated Similarly, the product of two elements in by the irreducible pentanomial can be computed. For this polynomial, , , , . The decomposition of for , and is

(11) , , , , and match those given in where (10). The decomposition in (11) is given as follows:

IMAA: LOW LATENCY

POLYNOMIAL BASIS MULTIPLIER

939

TABLE IV COORDINATES d OF THE PRODUCT FOR THE PENTANOMIAL f (x) = x + x + x + x + 1. FIRST STEP OF MULTIPLICATION

(12)

The product coordinates for this pentanomial are given by the left and right blocks in Table III, where it can be observed that is found in the coordinates the group , , , and ; the group is in , , , and ; the group can be found in , , , is in , , and . In this and ; and the group block and at the case, the dark shadowed terms in the block can be upper left for the , , found in Table III in the three groups of columns corresponding to , , and . These and terms are light shadowed at the bottom for the corresponding pentanomial block. Furthermore, these terms start at the rows , , for , , and corresponding with the values and , respectively. Based on the above observations, is a new low-latency polynomial basis multiplier for introduced in Section III. III. A NEW LOW-LATENCY MULTIPLIER From the examples given in Section II, it can be observed can be comthat the product of two eld elements in puted in three steps. The rst one is the computation of the core and terms. The dark-shadowed additions (in Table III) of second step is the shift of these previously computed additions coordinates of the product, where the shift is through the stopped at the th coordinates that correspond with the th not-null powers in the irreducible polynomial . The nal terms with step involves the addition of the corresponding the previously computed additions. A. Pentanomial It can be noted that the columns given for in Table III are shifts of the second column of ( block). Furthermore, the number and stopping-co-

ordinates of these shifts are given by the parameters and for any general irreducible polynomial. We can see this fact in Table IV, where the columns represent the co, of the product, and the numbers ordinates , into the rows represent the terms or . The symbol represent a null term. The rst row corresponds and terms given by . The shadowed rows in with the the table represent the addition of the above two rows in such a way that, for example, the rst shadowed row in Table IV cor, , , responds to , , , , . For this pentanomial, , , , and therefore , , . The value implies a 4-positions left-shift of -row, represented in Table IV with the -row. Simithe rst and lead to a 5-left-shift and 7-left-shift of larly, the rst row ( -row and -row in Table IV, respectively). It can ) does not rebe observed that the 7-left-shift ( -row, for quire any rows addition, therefore simplifying the computations. Furthermore, the positions where the row shifts are stopped for and are and , respecfor . The above comtively, therefore matching with putations nish the rst step of the multiplication, represented by the bottom dark-shadowed row in Table IV. This rst step , , gives the values , , , , , , which corresponds with the core upper-left dark-shadowed additions (in Table III). Once the core additions are computed, they must be rightshifted through the product coordinates, corresponding with the light-shadowed terms at the bottom of Table III. These shifts are represented in Table V. The number of right-shifts are given by , , and . It can the values of s, in this case be noted that the positions where the row shifts are stopped for s. After each shift, addition is performed and the s are represented in shadowed rows. Finally, the bottom dark-shadowed row in Table V nishes the second step of the multiplica, tion, giving the values , , , , , , . The nal step corresponds with the addition of the above coordinates with the corresponding terms given in the rst column of in Table III. B. Pentanomial As in the previous pentanomial, the columns for in Table III are shifts of the second column

940

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

TABLE V COORDINATES d OF THE PRODUCT FOR THE PENTANOMIAL f (x) = x + x + x + x + 1. SECOND STEP OF MULTIPLICATION

TABLE VII COORDINATES d OF THE PRODUCT FOR THE PENTANOMIAL f (x) = x + x + x + x + 1. SECOND STEP OF MULTIPLICATION

TABLE VI COORDINATES d OF THE PRODUCT FOR THE PENTANOMIAL f (x) = x + x + x + x + 1. FIRST STEP OF MULTIPLICATION

of ( block), and the number and stopping-coordinates of these shifts are given by the s and s. The rst step for is computed multiplication with in Table VI, given by the bottom dark-shadowed row that corresponds with the core upper-left dark-shadowed additions for , , this pentanomial in Table III. In this case, and therefore , , . As in the previous pentanomial, the value implies a 3-positions -row, represented in Table VI with the left-shift of the rst -row. However, in this table, there is another -row. This is , apart from the previous 3-left-shift of the because for rst row, another 3-left-shift can be done, represented with the second -row in Table VI. This second shift for corresponds term appearing at the top of column . with the additional , , and This term makes that the additional columns appear in Table VI for , which in turn correspond with the additional , , and matrices in (11). The values and imply a 5-left-shift and 7-left-shift of the rst row ( -row and -row in Table VI, respectively). The position where the row is . In both cases, shifts are stopped for for ( for and no-shift the coordinates are for ). For , there are two positions: and . The coordinate corresponds with the expression , while that corresponds with the expression . for In fact, the general expression is verifying that . The above computations nish the rst step of the multiplication, represented by the bottom dark-shadowed row in Table VI. This rst step gives the values , , , , , , , , which corresponds with the core upper-left dark-shadowed additions (in Table III) for this pentanomial. Once the core additions are computed, they must be rightshifted through the product coordinates, corresponding with the

light-shadowed terms at the bottom of Table III. These shifts are represented in Table VII. The number of right-shifts are given by , , and , and the values of s, in this case the positions where the row shifts are stopped for the s are s. After each shift, addition is performed and represented in shadowed rows. Finally, the bottom dark-shadowed row in Table VII nish the second step of the multiplication, giving the , values , , , , , , . The nal step corresponds with the addition of the above coordinates with the corresponding terms given in the rst column of in Table III. C. Heptanomial The proposed multiplication method is valid for any general irreducible polynomial. In order to illustrate the method, the irreducible heptanomial is briey considered. The rst step for multiplication is computed in Table VIII, given by the bottom dark-shadowed row. In this table, the columns represent the coordinates , , of the product, and the hexadecimal numinto the rows represent the terms bers or , . The symbol represents a null term. and terms given by The rst row corresponds with the the matrix for . In this case, , , , , . and therefore , , , , . As in the previous examples, implies a 6-positions left-shift of the rst the value -row, which is represented in Table VIII with the -row. , , , and Similarly, the values imply a 7-left-shift, 8-left-shift, 9-left-shift, and 10-left-shift of the rst row ( -row, -row, -row and -row in Table VIII, respectively). The positions where the row shifts are stopped for , , , , are , , , , and , respectively. It can be noted that these coordinates correfor spond with the general expression verifying that . The above computations nish the rst step of the multiplication, represented by the bottom dark-shadowed row in Table VIII. Once the core additions are computed, they must be rightshifted through the product coordinates. These shifts are represented in Table IX. The number of right-shifts are given by the , , , , and values of s, in this case , and the positions where the row shifts are stopped for the s are s. After each shift, addition is performed and

IMAA: LOW LATENCY

POLYNOMIAL BASIS MULTIPLIER

941

TABLE VIII COORDINATES d OF THE PRODUCT FOR THE HEPTANOMIAL f (x) = x + x + x + x + x + x + 1. FIRST STEP OF MULTIPLICATION

TABLE IX COORDINATES d OF THE PRODUCT FOR THE HEPTANOMIAL f (x) = x + x + x + x + x + x + 1. SECOND STEP OF MULTIPLICATION

represented in shadowed rows. Finally, the bottom dark-shadowed row in Table IX nish the second step of the multiplication. The nal step corresponds with the addition of the above coordinates with the corresponding terms which can be computed for using Table I. D. Conditions for the Low-Latency Multiplier From the above examples, it can be noted that the computation of the product over with an irreducible polynocan be mainly mial made with shifts and parallel additions. Moreover, the number of shifts needed for the computation of the core additions deas shown in the previous examples: pends on the value of and , only one for the rst pentanomial with -left-shift is needed for and one -left-shift is needed as given in Table IV; however, for the second penfor tanomial with and , two -left-shifts are needed and one -left-shift is needed for as given in for Table VI. -left-shift for in the second penThe additional tanomial is consequence of the existence of the coordinate at in (12). This coordinate corthe bottom-right in the matrix appearing at the top of the responds with the additional term in the Table III for . column From the construction of the matrices given in Section II, with , we can nd that if the th column of presents coefcients in its lower rows,

, then has no null elements. This condition and if for turns into . This fact can be checked for the example pentanomials: , we have for that and appears at the bottom-right in in (12), i.e., is not null; moreover, corresponds for in the Table III; with the additional term for AES pentanomial , we have that and has null elements. , the computation can be simplied Therefore, for because only one -left-shift will be needed for each non-null power . Using that , the previous condition can . Furthermore, if the maximum be rewritten as , then we non-null power veries this condition powers also verify it. can assure that the remaining The condition for the simplication of the multiplication is particularly important because the ve binary irreducible polynomials recommended by NIST verify this condi, , tion, i.e., , , and . E. Architecture of the Low-Latency Multiplier Based on the previous points and on the example given in multiplication using a general Tables IV and V, the irreducible polynomial with can be computed with the following

942

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

Fig. 1. Architecture of the low-latency

GF (2

) multiplier for

m  2k 0 1.
while that the fourth step corresponds with the shift of the core additions through the coordinates of the product (light-shadowed terms at the bottom of Table III). The last step of the algorithm corresponds with the addition of the corresponding terms with the previously computed coordinates. The previous algorithm can be implemented with the datapath , and are -bit registers, given in Fig. 1, where is a -bit shift-register, and are permublock computes the , tation blocks (re-wiring) and the and functions. The operands and are initially loaded into and regwill be isters, respectively, and the product register. nally stored into the From Tables I and II, it can be observed that the and functions are sum-of-products implemented as inner products and . of the operands Furthermore, the same hardware can be used for the implemenand terms; the only difference is that the tation of terms are computed with the vectors and , while that the terms are computed with the same vectors in reverse order, i.e., and . In Fig. 1, the block is used for the

algorithm: 1) Compute the functions (for ) ) using Table I. and the functions (for functions are assigned The to the coordinates of the product. is initially assigned to a null value. The coordinate , with , and in 2) For each non-null power of descending order, perform the additions , . We denote the new coordinates computed in this step as . , perform . 3) For 4) For each non-null power of in ascending order, perform the additions , . ) with the 5) Add the coordinates (for functions , respectively. It can be noted that the second step of the algorithm (computation of ) corresponds with the computation of the core additions (upper-left dark-shadowed additions in Table III),

IMAA: LOW LATENCY

POLYNOMIAL BASIS MULTIPLIER

943

of

Fig. 2. Permutation block

computation of the terms using the contents of and registers; the same block is also used for computing terms when the contents of and are inverted using the and (Fig. 2). The operands for the permutation blocks block are selected using multiplexers with control inputs generated by an additional control unit which also controls the remaining functional units of the datapath. There are two key points in the previous algorithm for the construction of a low-latency sequential architecture, which datapath is given in Fig. 1. The rst one is in the of , second step, where for each not-null power , are added with the terms the coordinates , respectively (i.e., the additions are computed). These additions can be done with the -bit reg(used to accumulate the coordinates) and the ister -bit shift-register used to store (in its lower bits) the functions initially comblock, in such a way that is XOR-ed puted with the block) with the upper bits of (initialized to ( 0). Then starting with the lower not-null power of and , continuing in ascending order of is right-shifted -bits (with stop at the coordinate of ) and added (XOR-ed) with . This addition is also stored in , that nally contains the core (third step in the algorithm). It is additions important to note that the total number of right-shifts needed to . These operations can be perform the above additions is identied in the Table IV, where the rst row corresponds with and the -rows correspond with the initial content of after the shifts. The partial the upper bits of register ) are represented with light-shadadditions (contents of owed rows and the nal result is given by the dark-shadowed . For the example in Table IV, the not-null power row of implies no action (shift), corresponding with the symbols in the -row in Table IV. An important observation is that the -left-shifts for each in Table IV corresponds -right-shifts of the register. This with the is a key point in order to reduce the latency of the multiplier. The second key point is the fourth step in the algoof , the rerithm, where for each not-null power , of the second step of the alsults register) are added with gorithm (contents of the , (also contents the previously computed

), respectively. In other words, the additions are comas puted. This operations can be done using again follows. The content of (result of the second step of , while the algorithm) is copied to the upper -bits of -bits are initialized to 0. Then starting its remaining of and continuing in with the lower not-null power ascending order of , is right-shifted -bits (with stop at the coordinate of ) and added block with . This addition is (XOR-ed) in the . It must be noted that the total number also stored in of right-shifts needed to perform the above additions is . These operations can be identied in the Table V, where the computed in rst row corresponds with the content of -rows correspond the second step of the algorithm, and the with the upper bits of register after the shifts. The ) are represented with partial additions (contents of light-shadowed rows and the nal result is given by the bottom . dark-shadowed row The fth step in the algorithm must add the coordinates previously computed and stored in with the functions , respectively. This terms with the step can be performed rst computing the block and then loading these terms in the upper bits . The nal addition of and the upper bits of of , with the result being loaded again in , will nish the product computation. The control signals (shift and load of registers, control of multiplexers, etc.) of the datapath given in Fig. 1 are provided by a control unit, that generates the signals depending on the not-null . Furthermore, the powers of the irreducible polynomial versatility of the proposed multiplier for nite elds with can be obtained by the control unit by lling zeros the lower bits of in the second with bits of will step of the algorithm. The following contain the functions computed block. By this way, the correct additions are perby the is right shifted and XOR-ed with formed when in order to nally compute the core additions. The upper bits of , and registers will be lled with zeros . for F. Complexity Analysis and Comparisons block implements the sum-of-products needed for The terms , because the the realization of the terms range from . Therefore, the block is given by the maximum comcomplexity of the functions given in Table I. The theoretical plexity of the complexities of these functions are the following [7]. terms (odd ) are the sum of 1 term The and terms . Therefore, needs 2-input AND gates and a binary tree -input XOR gates. The depth of the XOR of binary tree will be , so the delay of the terms will be , where and represent the delay of 2-inputs AND and XOR gates, respectively.

944

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

TABLE X HARDWARE RESOURCES AND MULTIPLICATION TIME COMPARISONS

The terms (even ) are the sum of terms , so they will need AND gates and a biXOR gates with depth nary tree of . Therefore, the delay of the terms . will be term (whether even or odd) will have AND The gates, XOR gates and a delay . The space complexity of the block will be computed by the addition of the AND and XOR complexities of the terms term. Therefore, (for odd ), the terms (for even ) and the block will be implemented with AND the gates, XOR gates, and with a maximum delay (corresponding to the term). The multiplier proposed in Fig. 1 also presents 3 -bit regand one -bit shift-register isters , therefore totalizing 1-bit latches. It must be -bit shift-register presents an noted that the increased internal complexity in comparison with the -bit reg. Furthermore, two-to-one MUXs isters and additional XOR gates ( block) are needed for the imdo not contribute plementation. Permutation blocks with any complexity, because they only perform re-wiring. From the architecture in Fig. 1, it can be observed that the , where is critical path delay is the delay of a two-to-one MUX. However, the most important result of the proposed multiplier architecture corresponds to the number of clock cycles needed with for the complete multiplication of two numbers in a generating irreducible polynomial . There will be necessary 1 clock cycle for the rst cycles step of the algorithm given in the Section III-E, for the right-shifts and additions needed in the second step of the algorithm, cycles for the right-shifts and additions in the fourth step, and an additional cycle for the fth step of the algo. rithm. Therefore, the latency of the multiplier will be

In Table X, the theoretical complexity of our proposed mulmultipliers found in tiplier is compared with several the literature. In this table, the number of 2-input AND and XOR gates, the number of 2:1 MUXs and 1-bit latches, critical path , , and delay and latency are given. Furthermore, represent the delay of an inverter, 2-input OR gate and 2-input NAND gate, respectively. In [6], a not versatile least-signicant bit-serial (LSB) multiplier was given. Versatile bit-serial multipliers were also given in [5], [12], and [3] (Montgomery multiplier). In Table X, the MUX column for [12] corresponds to the number of DMUXs used in the design, where additional OR gates must be added to its hardware complexity. A nonversatile bit-level-pipelined systolic multiplier for irreducible trinomials was also given in [20] . Nonversatile semisystolic designs were given in [23] and [24], while that versatile semisystolic multipliers were presented in [9] and [3] (pipelined-semisystolic Montgomery). A nonversatile systolic multiplier/squarer was given in [13] and a bit-parallel systolic multiplier for irreducible trinomials was presented in [16]. An scalable and systolic Montgomery multiplier for irreducible trinomials was given in [15] , where a bit-parallel systolic multiplier for trinomials was also given [15] . A bit-parallel systolic multiplier for trinomials was also presented in [14]. In [3] , a versatile partially pipelined Montgomery multiplier was given pipeline stages). Digit-level systolic (with digit-size (with ) and super-systolic multiplier designs for irreducible trinomials were presented in [20] and [20] , respectively. For [20] in Table X, AND column corresponds to the number of NAND gates used in the design. A not versatile least-signicant digit-serial/parallel (LSD) multiplier (with digit-size ) was pipeline given in [22], and a nonversatile pipelined (with stages) digit-serial systolic multiplier was presented in [4]. Our partially versatile design is represented as Prop in Table X. In order to highlight the differences between the various multiplier approaches, specic results for the eld gen-

IMAA: LOW LATENCY

POLYNOMIAL BASIS MULTIPLIER

945

TABLE XI HARDWARE RESOURCES AND MULTIPLICATION TIME COMPARISONS FOR NIST

GF (2

erated by the NIST irreducible polynomial are given in Table XI, where , and the digit-size (for digit-level multipliers) has been selected . It must be noted that some of the multipliers given to in Table XI are optimized for irreducible trinomials. From the results, it can be observed that our proposed multiplier presents the lowest latency among all the considered multipliers. Bit-serial versatile multipliers ([3] , [5], [12]) present reduced space complexity but need more clock cycles to nish one multiplication. Even [3] needs a considerable amount of 1-bit latches for performing multiplication with a very high latency. Versatile semisystolic multipliers ([3] , [9]) present higher space complexities and latencies than our proposed multiplier, while that the versatile partially-pipelined multipler given in [3] has reduced space complexity but with a high latency. Nonversatile multipliers present in general optimized designs than versatile counterparts. Although the comparison with nonversatile multipliers cannot be considered fair, our proposed multiplier has very good space complexity (always with the lowest latency) in comparison with nonversatile designs. Moreover, from the comparison with optimized digit-serial multipliers ([20] , [20] , [4], [22]), it can be observed that our multiplier presents a very interesting results. It must also be noted that the selection of the irreducible polynomial is fundamental in order to obtain the low , latency. In a worst case scenario, such as for the latency of our multiplier would be comparable to other designs. Therefore, irreducible polynomials with the lowest possible values of should be chosen. Table XII compares the area (transistors count) and delay multipliers given in Table XI. estimate for the NIST In order to do that, some STMicroelectronics real circuits are used to compare time complexities. They are high-speed CMOS gates fabricated by silicon gate C MOS technology, with balanced propagation delays, low power dissipation,

has been and high speed. The typical propagation-delay used to ensure a fair comparison. The circuits used have been 6 ns), M74HC86 (XOR gate, M74HC08 (AND gate, 12 ns), M74HC257 (MUX, 11 ns), M74HC32 8 ns), M74HC04 (INV gate, 8 ns) and (OR gate, 6 ns). For the estimate of the M74HC00 (NAND gate, number of CMOS transistors used in the designs, the traditional counts have been used: six transistors for 2-input AND gate, six for 2-input XOR gate, six for MUX 2:1, eight for SR-latch, six for 2-input OR gate, and four transistors for 2-input NAND gate. In Table XII, the multiplication time (given in nanoseconds) is computed as the product of the critical path by the latency, and the Area Delay column is given in transistors miliseconds. It must be noted that some of the results given in Table XII correspond to multipliers optimized for irreducible trinomials, not for general irreducible polynomials. Although the comparison of different architectures of multipliers can not be fair, it can be observed that our proposed multiplier has a moderate area consumption in comparison with other designs. The time delay of our multiplier is the lowest third one from all the designs. Only the multipliers in [20] and [20] have lower time delay, but these designs are systolic and optimized for trinomials. With respect to the area delay metric, our multiplier presents a very interesting result. A comment could be made with respect to the theoretical throughput of our multiplier. The number of clock cycles , and between two consecutive multiplication results is critical path latency . This the theoretical throughput is value could not be a good result in comparison with some of the systolic or semisystolic designs (some of them achieving one result in every clock cycle), however the comparison of systolic or semisystolic designs with non systolic or bit serial approaches may not be fair and may not lead to valuable conclusions.

946

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 58, NO. 5, MAY 2011

TABLE XII AREA-DELAY ESTIMATE FOR NIST

GF (2

IV. CONCLUSION A new low latency parallel-in/parallel-out sequential polyis presented. For irrenomial basis multiplier over with ducible polynomials , the proposed multiplier has a theoretical latency cycles. This latency is the lowest one found in the of multipliers. The condition literature for is specially important because the ve binary irreducible polynomials recommended by NIST for elliptic curve cryptography (ECC) implementation verify this condition. Furthermore, the proposed multiplier is partially versatile in the sense that the , with , datapath can be used for nite elds is fulachieving the low latency if the constraint lled. The architecture and complexity analysis of the proposed mulmultiplier are given. The comparison with other tipliers is presented, and results for the specic eld are also given. From the comparison with other multipliers, it is observed that our approach presents the lowest latency with a moderate hardware complexity. ACKNOWLEDGMENT The author would like to thank the referees for their valuable comments and some important corrections. REFERENCES
[1] T.-C. Chen, S.-W. Wei, and H.-J. Tsai, Arithmetic unit for nite eld (2 ), IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 3, pp. 828837, Apr. 2008. [2] H. Fan and M. A. Hasan, Fast bit parallel-shifted polynomial basis multipliers in (2 ), IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 12, pp. 26062615, Dec. 2006. [3] A. P. Fournaris and O. Koufopavlou, Versatile multiplier architectures in (2 ) elds using the Montgomery multiplication algorithm, Integr. VLSI J., vol. 41, no. 3, pp. 371384, May 2008. [4] J.-H. Guo and C.-L. Wang, Digit-serial systolic multiplier for nite elds (2 ), Proc. Inst. Electr. Eng. Comput. Digit. Tech., vol. 145, no. 2, pp. 143148, 1998.

[5] M. A. Hasan and M. Ebtedaei, Efcient architectures for computations over variable dimensional galois elds, IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 45, no. 11, pp. 12051211, Nov. 1998. [6] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve Cryptography. New York: Springer, 2004. [7] J. L. Imaa, J. M. Snchez, and F. Tirado, Bit-parallel nite eld multipliers for irreducible trinomials, IEEE Trans. Comput., vol. 55, no. 5, pp. 520533, May 2006. [8] J. L. Imaa, R. Hermida, and F. Tirado, Low complexity bit-parallel multipliers based on a class of irreducible pentanomials, IEEE Trans. VLSI Syst., vol. 14, no. 12, pp. 13881393, Dec. 2006. [9] S. K. Jain, L. Song, and K. Parhi, Efcient semisystolic architectures for nite-eld arithmetic, IEEE Trans. VLSI Syst., vol. 6, no. 1, pp. 101113, Mar. 1998. [10] K. Kobayashi and N. Takagi, A combined circuit for multiplication and inversion in (2 ), IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 11, pp. 11441148, Nov. 2008. [11] K. Kobayashi and N. Takagi, Fast hardware algorithm for division in (2 ) based on the extended euclids algorithm with parallelization of modular reductions, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 8, pp. 644648, Aug. 2009. [12] P. Kitsos, G. Theodoridis, and O. Koufopavlou, An efcient recongurable multiplier architecture for Galois eld (2 ), Microelectron. J., vol. 34, no. 10, pp. 975980, 2003. [13] H. S. Kim and K. Y. Yoo, Area efcient exponentiation using modular multiplier/squarer in (2 ), in LNCS2108. New York: Springer, 2001, pp. 262267. [14] C.-Y. Lee, Y.-H. Chen, C.-W. Chiou, and J.-M. Lin, Unied parallel systolic multiplier over (2 ), J. Comput. Sci. Technol., vol. 22, no. 1, pp. 2838, Jan. 2007. [15] C.-Y. Lee, C.-W. Chiou, J.-M. Lin, and C.-C. Chang, Scalable and systolic Montgomery multiplier over (2 ) generated by trinomials, IET Circuits, Devices, Syst., vol. 1, no. 6, pp. 477484, Dec. 2007. [16] C.-Y. Lee, Low complexity bit-parallel systolic multiplier over (2 ) using irreducible trinomials, Proc. Inst. Electr. Eng.Comput. Digit. Tech., vol. 150, no. 1, pp. 3942, Jan. 2003. [17] J. Lin, J. Sha, Z. Wang, and L. Li, An efcient VLSI architecture for nonbinary LDPC decoders, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 1, pp. 5155, Jan. 2010. [18] J. Lin, J. Sha, Z. Wang, and L. Li, Efcient decoder design for nonbinary quasi-cyclic LDPC codes, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 5, pp. 10711082, May 2010. [19] E. D. Mastrovito, VLSI architectures for computation in Galois elds, Ph.D. dissertation, Dept. Electr. Eng., Linkping Univ., Linkping, Sweden, 1991. [20] P. K. Meher, Systolic and super-systolic multipliers for nite eld (2 ) based on irreducible trinomials, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 4, pp. 10311040, May 2008. [21] A. Reyhani-Masoleh and M. A. Hasan, Low complexity bit parallel architectures for polynomial basis multiplication over (2 ), IEEE Trans. Comput., vol. 53, no. 8, pp. 945959, Aug. 2004. [22] L. Song and K. Parhi, Low-energy digit-serial/parallel nite eld multipliers, J. VLSI Signal Process., vol. 19, no. 2, pp. 149166, 1998. [23] W. C. Tsai and S.-J. Wang, Two systolic architectures for multiplication in (2 ), Proc. Inst. Electr. Eng.Comput. Digit. Tech, vol. 147, no. 6, pp. 375382, 2000. [24] C. L. Wang and J. L. Lin, Systolic array implementation of multipliers for nite elds (2 ), IEEE Trans. Circuits Syst., vol. 38, no. 7, pp. 796800, Jul. 1991.

GF

GF

GF

GF

GF

GF

GF

GF

GF

GF

GF

GF

GF

GF

GF

Jos Luis Imaa received the M.Sc. and Ph.D. degrees in physics from Complutense University, Madrid, Spain, in 1989 and 2003, respectively. He was an Electronic Design Engineer at the Madrid Institute of Technology and a Professor at the Computer Science College, Segovia, Spain. He is currently with the Department of Computer Architecture and Systems Engineering at the Complutense University, where he is an Associate Professor. He has been the promoter and co-founder of the International Workshop on the Arithmetic of Finite Fields (WAIFI). His research interests include algorithms and VLSI architectures for computations in nite elds, cryptography, computer arithmetic, and recongurable computing architectures.

You might also like