You are on page 1of 70

INDEX

Title Page No
CHAPTER-1 INTRODUCTION 1-4
1.1 General 1
1.2 Literature Review 1
1.3 Proposed System 3
1.3.1 Advantages in proposed system 3
1.4 Thesis Outline 4
CHAPTER – 2 CRYPTOGRAPHY 5-13
2.1 Public key cryptosystems 5
2.2 Information Security 6
2.3 Public key cryptography 7
2.4 How it Works 8
2.5 RSA Algorithm 9
CHAPTER – 3 ADDER 11-17
3.1 Adder 11
3.1.1Half adder 11
3.1.2 Full adder 11
3.2 Ripple carry adder/sub tractor 12
3.3 Carry look ahed adder 13
3.4 Carry save adder 15
CHAPTER – 4 VLSI TECHNOLOGY 18-39
4.1 Introduction of VLSI 18
4.2 Why VLSI 20
4.2.1 Structured Design 22
4.3 What is VLSI 23
4.3.1 History of Scale Integration 23
4.3.2 System Design 23
4.3.3 MOS and related VLSI Technology 23
4.4 Design of VLSI 24
4.4.1 Challenges 25
4.5 VLSI and SYTEMS 25
4.6 Applications of VLSI 26
4.7 ASIC 27
4.8 ASIC Design flow 27
4.9 CMOS Technology 28
4.9.1 MOS Transistor 29
4.9.2 Power Dissipation in CMOS IC’s 30
4.9.3 CMOS Transmission Gate 30
4.10 Simple ASIC Design flow 31
4.10.1 Register Transfer Logic 31
4.10.2 Optimization 32
4.11 FPGA DESIGN FLOW 32
4.11.1 Introduction to High Capacity FPDS 34
4.11.2 Definitions of Relevant Terminology 35
4.12 Basic FPGA architecture 36
4.12.1 FPGA Design & Programming 36
4.12.2 Advantages of HDLs to FPGA Devices 37
4.12.3 FPGA To ASIC Comparisons 38
4.13 VHDL & VERILOG 39
CHAPTER 5 SOFTWARE REQUIREMENTS 40-47
5.1 Synthesis Tool 40
5.1.1 XILINX ISE 13.1 40
5.1.2 XST Design Constraints 42
5.1.3 Architectural Overview 43
5.1.4 History of IEE 1364 Verilog standard 44
5.1.5 Modeling Enhancements 44
CHAPTER 6 IMPLEMENTATION OF PROPOSED METHOD 48-54
6.1 Introduction 48
6.2 Algorithm for Montgomery multiplier 49
6.2.1 Algorithm for Montgomery multiplier for CSA50
6.2.2 Algorithm for MMM using CSA 52
CHAPTER 7 SIMULATION USING VERILOG 55 -63
7.1 General 55

CHAPTER 8 RESULTS 64 - 67
8.1 General 64
8.2 Simulation Results 64
CHAPTER 9 APPLICATIONS 68
9.1 General 77

FUTURE SCOPE 69

CONCLUSION 70

REFERENCES 71-72
CHAPTER-1

INTRODUCTION

1.1 INTRODUCTION

Due to the rapid increase in internet services and data communication, such as electronic
commerce, fundamental security requirements for protecting sensitive data during electronic
dissemination have become an important concern. Many systems utilize public-key cryptography
to provide such security services, and Rivest, Shamir, and Adleman (RSA) is one of the most
widely adopted public key algorithms at present. However, RSA requires repeated modular
multiplications to accomplish the computation of modular exponentiation, and the size of
modulus is generally at least 1024 bits for long-term security. Therefore, high data throughput
rates without hardware acceleration are difficult to be achieved. Additionally, security
requirements are increasingly important for private data transmission through mobile devices
with Internet access, such as smart phones and notebook computers, which require an energy-
efficient cryptosystem due to their limitation of battery power. For this kind of application, it is
necessary to develop efficient hardware architectures to carry out fast modular multiplications
with low power consumption.

A famous approach to implement modular multiplication in hardware circuits is based on


the Montgomery modular multiplication algorithm, which replaces the trial division by modulus
with a series of addition and shift operations, and thus its critical operation is basically a three-
operand addition with an iteration loop. Unfortunately, time-consuming carry propagations from
the addition of long operands seriously influence the performance of the RSA cryptosystems.
Accordingly, several approaches toward avoiding long carry propagation during the addition
operation have been proposed to achieve a significant speed up of Montgomery modular
multiplication. These approaches can be roughly classified into two categories relying on the
representation of intermediate results of modular multiplication in the exponentiation.

1.2 LITERATURE REVIEW

The ever increasing density and functional complexity of digital integrated


circuits requires more efficient and power saving. A famous approach to implement modular
multiplication in hardware circuits is based on the Montgomery modular multiplication algorithm
[2], [3], which replaces the trial division by modulus with a series of addition and shift
operations, and thus its critical operation is basically a three-operand addition with an iteration
loop. Unfortunately, time-consuming carry propagations from the addition of long operands
seriously influence the performance of the RSA cryptosystems. Accordingly, several approaches
toward avoiding long carry propagation during the addition operation have been proposed to
achieve a significant speed up of Montgomery modular multiplication. These approaches can be
roughly classified into two categories relying on the representation of intermediate results of
modular multiplication in the exponentiation.

In the first category (e.g., [4]–[7]), the inputs and outputs of the Montgomery modular
multiplication are represented in binary form, but intermediate results of modular multiplication
are kept in carry-save representation to avoid the carry propagation. However, the format
conversion from the carry-save representation of the final product into its binary representation
must be performed at the end of each modular multiplication. This conversion can be simply
accomplished by adding the carry and sum terms of carry-save representation. But the addition
still suffers from long carry propagation, and extra circuit and time are probably needed for these
conversions. The second category of approaches (e.g., [8]–[11]) eliminates repeated interim
output-to-input format conversions through maintaining all inputs and outputs of the Montgomery
modular multiplication in carry-save form except the final step for getting the result of modular
exponentiation. However, this implies that the number of operands in modular multiplication
must be increased so that additional registers to store these operands are required.

For example, the work in [8] proposed two variants of Montgomery multiplication
algorithm, which use carry-save adder (CSA) to accomplish the modular exponentiation. The first
of these variants is based on a five-to-two CSA to avoid the repeated interim output-to-input
format conversion. To further decrease the number of input operands from five to four, three
input operands are selected and combined into the corresponding carry-save form at the
beginning of each modular multiplication. However, extra multiplexers and select signals are
necessary to choose the desired input operands for four-to-two CSA.

The internal behavior and structure of BRFA so that the gated clock design technique can
be applied to obviously decrease the energy consumption of BRFA. Experimental results show
that 36% energy saving and 19.7% cycle reduction can be achieved for the 1024-bit Montgomery
multiplier by bypassing the superfluous operations. Additionally, applying clock gating to
registers and the proposed technique to BRFA of the 1024-bit Montgomery multiplier will lead to
24% more energy reduction.

1.3 PROPOSED SYSTEM

In previous work to perform the addition in multiplier they used the Carry Save Adder
(CSA). To increase the performance of the multiplier with respective to the speed in this research
work we are going to design Montgomery modular multiplier with Carry Select Adder (CSLA).

1.3.1 ADAVANTAGES IN PROPOSED SYSTEM

The proposed method does not use the large-sized multiplexers but The goal is achieved
by first modifying the CSA-based Montgomery algorithm to bypass the iterations that perform
superfluous carry-save addition and register write operations in the add-shift loop. As a result, not
only the addition and shift operations but also the number of clock cycles required to complete
the Montgomery multiplication can be largely decreased, leading to significant energy saving and
higher throughput
1.4 THESIS OUTLINE

Chapter 1 describes the introduction of previous techniques, their drawbacks, and the
advantages in proposed method. Chapter 2 covers the basics of Cryptography. In Chapter 3 we
will discuss the implementation of different adders used for their design. Chapter 4 covers the
basics of VLSI technology, fabrication of MOS transistors, ASIC design flow, FPGA design flow
& Hardware Description Languages used in VLSI. In Chapter 5 the Synthesis tool Xilinx 13.1
with their XST design constraints are discussed. Chapter 6 covers the proposed method
implementation. Chapter 7 covers the design of adder using verilog code. Chapter 8 shows the
function of CSA and modifiedCSA using Xilinx 13.1. Chapter 9 shows the applications of the
proposed method. The later sections describe the Future Scope, Conclusion & References for this
document. The last Section display the International Journal Publication of this project.
2.1 Public-Key Cryptosystems:
In a public key cryptosystem each user places in a public _le an encryption procedure E.
That is, the public _le is a directory giving the encryption procedure of each user. The user keeps
secret the details of his corresponding decryption procedure D. These procedures have the
following four properties: (a) Deciphering the enciphered form of a message M yields M.
Formally, D(E(M) = M: (1)
(b) Both E and D are easy to compute.
(c) By publicly revealing E the user does not reveal an easy way to compute D. This means that in
practice only he can decrypt messages encrypted with E, or compute D e_ciently. (d) If a message
M is _rst deciphered and then enciphered, M is the result. Formally, E(D(M) = M: (2) An
encryption (or decryption) procedure typically consists of a general method and an encryption
key. The general method, under control of the key, enciphers a message M to obtain the
enciphered form of the message, called the ciphertext C. Everyone can use the same general
method; the security of a given procedure will rest on the security of the key. Revealing an
encryption algorithm then means revealing the key.

When the user reveals E he reveals a very incident method of computing D(C): testing all
possible messages M until one such that E(M) = C is found. If property (c) is satisfied the number
of such messages to test will be so large that this approach is impractical. A function E satisfying
(a)-(c) is a \trap-door one-way function;" if it also sates satisfies (d) it is a \trap-door one-way
permutation." Di_e and Hellman [1] introduced the concept of trap-door one-way functions but
did not present any examples. These functions are called \one-way" because they are easy to
compute in one direction but (apparently) very difficult to compute in the other direction. They
are called \trapdoor" functions since the inverse functions are in fact easy to compute once certain
private \trap-door" information is known. A trap-door one-way function which also satisfies (d)
must be a permutation: every message is the cipher text for some other message and every cipher
text is itself a permissible message. (The mapping is \one to- one" and \onto"). Property (d) is
needed only to implement \signatures." The reader is encouraged to read Di_e and Hellman's
excellent article [1] for further background, for elaboration of the concept of a public-key
cryptosystem, and for a discussion of other problems in the area of cryptography. The ways in
which a public-key cryptosystem can ensure privacy and enable \signatures" are also due to Di_e
and Hellman. For our scenarios we suppose that A and B (also known as Alice and Bob) are two
users of a public-key cryptosystem. We will distinguish their encryption and decryption
procedures with subscripts: EA;DA;EB;DB.
2.2 Information Security:

The concept of information will be taken to be an understood quantity. To introduce


cryptography, an understanding of issues related to information security in general is necessary.
Information security manifests itself in many ways according to the situation and
requirement.Regardless of who is involved, to one degree or another, all parties to a transaction
must have confidence that certain objectives associated with information security have been met.
Over the centuries, an elaborate set of protocols and mechanisms has been created to deal with
information security issues when the information is conveyed by physical documents. Often the
objectives of information security cannot solely be achieved through mathematical algorithms
and protocols alone, but require procedural techniques and abidance of laws to achieve the
desired result.

For example, privacy of letters is provided by sealed envelopes delivered by an accepted


mail service. The physical security of the envelope is, for practical necessity, limited and so laws
are enacted which make it a criminal offense to open mail for which one is not authorized. It is
sometimes the case that security is achieved not through the information itself but through the
physical document recording it.

For example, paper currency requires special inks and material to prevent counterfeiting.
Conceptually, the way information is recorded has not changed dramatically over time. Whereas
information was typically stored and transmitted on paper, much of it now resides on magnetic
media and is transmitted via telecommunications systems, some wireless. What has changed
dramatically is the ability to copy and alter information. One can make thousands of identical
copies of a piece of information stored electronically and each is indistinguishable from the
original. With information on paper, this is much more difficult.
What is needed then for a society where information is mostly stored and transmitted in
electronic form is a means to ensure information security which is independent of the physical
medium recording or conveying it and such that the objectives of information security rely solely
on digital information itself. One of the fundamental tools used in information security is the
signature. It is a building block for many other services such as non-repudiation, data origin
authentication, identification, and witnessing, to mention a few. Having learned the basics in
writing, an individual is taught how to produce a handwritten signature for the purpose of
identification. At contract age the signature evolves to take on a very integral part of the person’s
identity. This signature is intended to be unique to the individual and serve as a means to identify,
authorize, and validate.
With electronic information the concept of a signature needs to be redressed; it cannot
simply be something unique to the signer and independent of the information signed. Electronic
replication of it is so simple that appending a signature to a document not signed by the originator
of the signature is almost a triviality. Analogues of the “paper protocols” currently in use are
required. Hopefully these new electronic based protocols are at least as good as those they
replace. There is a unique opportunity for society to introduce new and more efficient ways of
ensuring information security. Much can be learned from the evolution of the paper based system,
mimicking those aspects which have served us well and removing the inefficiencies. Achieving
information security in an electronic society requires a vast array of technical and legal skills.
There is, however, no guarantee that all of the information security objectives deemed necessary
can be adequately met. The technical means is provided through cryptography.

2.3 Public-Key Cryptography:

Fig2.1.Public Key Cryptography

In an asymmetric key encryption scheme, anyone can encrypt messages using the public key, but
only the holder of the paired private key can decrypt. Security depends on the secrecy of that
private key.
In some related signature schemes, the private key is used to sign a message; anyone can
check the signature using the public key. Validity depends on security of the private key.In the
Diffie–Hellman key exchange scheme, each party generates a public/private key pair and
distributes the public key... After obtaining an authentic copy of each other's public keys, Alice
and Bob can compute a shared secret offline. The shared secret can be used, for instance, as the
key for a symmetric cipher.

Public-key cryptography refers to a cryptographic system requiring two separate keys,


one to lock or encrypt the plaintext, and one to unlock or decrypt the cyphertext. Neither key will
do both functions. One of these keys is published or public and the other is kept private. If the
lock/encryption key is the one published then the system enables private communication from the
public to the unlocking key's owner. If the unlock/decryption key is the one published then the
system serves as a signature verifier of documents locked by the owner of the private key.
Although in this latter case, since encrypting the entire message is relatively expensive
computationally, in practice just a hash of the message is encrypted for signature verification
purposes.This cryptographic approach uses asymmetric key algorithms such as RSA, hence the
more general name of "asymmetric key cryptography". Some of these algorithms have the public
key/private key property; that is, neither key is derivable from knowledge of the other; not all
asymmetric key algorithms do. Those with this property are particularly useful and have been
widely deployed, and are the source of the commonly used name. Although unrelated, the key
pair are mathematically linked. The public key is used to transform a message into an unreadable
form, decrypt able only by using the (different but matching) private key. By publishing the
public key, the key producer empowers anyone who gets a copy of the public key to produce
messages only s/he can read—because only the key producer has a copy of the private key
(required for decryption). When someone wants to send a secure message to the creator of those
keys, the sender encrypts it (i.e., transforms it into an unreadable form) using the intended
recipient's public key; to decrypt the message, the recipient uses the private key. No one else,
including the sender, can do so.Thus, unlike symmetric key algorithms, a public key algorithm
does not require a secure initial exchange of one, or more, secret keys between the sender and
receiver. These algorithms work in such a way that, while it is easy for the intended recipient to
generate the public and private keys and to decrypt the message using the private key, and while
it is easy for the sender to encrypt the message using the public key, it is extremely difficult for
anyone to figure out the private key based on their knowledge of the public key. They are based
on mathematical relationships (the most notable ones being the integer factorization and discrete
logarithm problems) that have no efficient solution.

The use of these algorithms also allows authenticity of a message to be checked by


creating a digital signature of a message using the private key, which can be verified using the
public key.Public key cryptography is a fundamental and widely used technology. It is an
approach used by many cryptographic algorithms and cryptosystems. It underpins such Internet
standards as Transport Layer Security (TLS) (successor to SSL), PGP, and GPG.

2.4 How It Works

The distinguishing technique used in public key cryptography is the use of asymmetric
key algorithms, where the key used to encrypt a message is not the same as the key used to
decrypt it. Each user has a pair of cryptographic keys — a public encryption key and a private
decryption key. The publicly available encrypting-key is widely distributed, while the private
decrypting-key is known only to the recipient. Messages are encrypted with the recipient's public
key and can be decrypted only with the corresponding private key. The keys are related
mathematically, but parameters are chosen so that determining the private key from the public
key is prohibitively expensive. The discovery of algorithms that could produce public/private key
pairs revolutionized the practice of cryptography beginning in the mid-1970s.

In contrast, symmetric-key algorithms, variations of which have been used for thousands
of years, use a single secret key which must be shared and kept private by both sender and
receiver for both encryption and decryption. To use a symmetric encryption scheme, the sender
and receiver must securely share a key in advance.Because symmetric key algorithms are nearly
always much less computationally intensive, it is common to exchange a key using a key-
exchange algorithm and transmit data using that key and a symmetric key algorithm. PGP and the
SSL/TLS family of schemes do this, for instance, and are thus called hybrid cryptosystems.

2.4.1 Description

The two main branches of public key cryptography are:

 Public key encryption: a message encrypted with a recipient's public key cannot be
decrypted by anyone except a possessor of the matching private key, it is presumed that
this will be the owner of that key and the person associated with the public key used. This
is used for confidentiality.
 Digital signatures: a message signed with a sender's private key can be verified by
anyone who has access to the sender's public key, thereby proving that the sender had
access to the private key (and therefore is likely to be the person associated with the
public key used), and the part of the message that has not been tampered with. On the
question of authenticity, see also message digest.

An analogy to public-key encryption is that of a locked mail box with a mail slot. The
mail slot is exposed and accessible to the public; its location (the street address) is in essence the
public key. Anyone knowing the street address can go to the door and drop a written message
through the slot; however, only the person who possesses the key can open the mailbox and read
the message.An analogy for digital signatures is the sealing of an envelope with a personal wax
seal. The message can be opened by anyone, but the presence of the seal authenticates the
sender.A central problem for use of public-key cryptography is confidence (ideally proof) that a
public key is correct, belongs to the person or entity claimed (i.e., is 'authentic'), and has not been
tampered with or replaced by a malicious third party. The usual approach to this problem is to use
a public-key infrastructure (PKI), in which one or more third parties, known as certificate
authorities, certify ownership of key pairs. PGP, in addition to a certificate authority structure, has
used a scheme generally called the "web of trust", which decentralizes such authentication of
public keys by a central mechanism, substituting individual endorsements of the link between
user and public key. No fully satisfactory solution to the public key authentication problem is
known.

2.5 RSA Algorithm:


This algorithm is based on the difficulty of factorizing large numbers that have 2 and only 2
factors (Prime numbers). The system works on a public and private key system. The public key is
made available to everyone. With this key a user can encrypt data but cannot decrypt it, the only
person who can decrypt it is the one who possesses the private key. It is theoretically possible but
extremely difficult to generate the private key from the public key, this makes the RSA algorithm
a very popular choice in data encryption.
Algorithm:
First of all, two large distinct prime numbers p and q must be generated. The product of
these, we call n is a component of the public key. It must be large enough such that the numbers p
and q cannot be extracted from it - 512 bits at least i.e. numbers greater than 10154. We then
generate the encryption key e which must be co-prime to the number m = '(n) = (p ¡ 1)(q ¡ 1). We
then create the decryption key d such that de modm = 1. We now have both the public and private
keys.
Encryption:
We let y = E(x) be the encryption function where x is an integer and y is the encrypted form of x
y = xe mod n
Decryption:
We let X = D(y) be the decryption function where y is an encrypted integer and X is the decrypted
form of yX = yd mod n
1.5 Simple Example
1. We start by selecting primes p = 3 and q = 11.
2. n = pq = 33
m = (p ¡ 1)(q ¡ 1) = (2)(10) = 20.
1
3. Try e = 3
gcd(3; 20) = 1
) e is co-prime to n
4. Find d such that 1 ´ de modm
) 1 = Km + de
Using the extended Euclid Algorithm we see that 1 = ¡1(20) + 7(3)
)d=7
5. Now let’s say that we want to encrypt the number x = 9:
We use the Encryption function y = xe mod n
y = 93 mod 33
y = 729 mod 33 ´ 3
)y=3
6. To decrypt y we use the function X = yd mod n
X = 37 mod 33
X = 2187 mod 33 ´ 9
)X=9=x
) It Works!
3.1 Adder:
In electronics, an adder or summer is a digital circuit that performs addition of numbers.
In many computers and other kinds of processors, adders are used not only in the arithmetic logic
unit(s), but also in other parts of the processor, where they are used to calculate addresses, table
indices, and similar.Although adders can be constructed for many numerical representations, such
as binary-coded decimal or excess-3, the most common adders operate on binary numbers. In
cases where two's complement or ones' complement is being used to represent negative numbers,
it is trivial to modify an adder into an adder–subtractor. Other signed number representations
require a more complex adder.

3.1.1 Half adder:

Fig 3.1 Half Adder logic diagram

The half adder adds two one-bit binary numbers A and B. It has two outputs, S and C (the value
theoretically carried on to the next addition); the final sum is 2C + S. The simplest half-adder
design, pictured on the right, incorporates an XOR gate for S and an AND gate for C. With the
addition of an OR gate to combine their carry outputs, two half adders can be combined to make a
full adder.

3.1.2 Full adder:

Fig.3.2 Full adder block

Schematic symbol for a 1-bit full adder with Cin and Cout drawn on sides of block to emphasize
their use in a multi-bit adder.A full adder adds binary numbers and accounts for values carried in
as well as out. A one-bit full adder adds three one-bit numbers, often written as A, B, and Cin; A
and B are the operands, and Cin is a bit carried in from the next less significant stage.[2] The full-
adder is usually a component in a cascade of adders, which add 8, 16, 32, etc. binary numbers.
The circuit produces a two-bit output sum typically represented by the signals Cout and S.

Fig.3.3 Full adder (using gates)

A full adder can be implemented in many different ways such as with a custom transistor-
level circuit or composed of other gates. In this implementation, the final OR gate before the
carry-out output may be replaced by an XOR gate without altering the resulting logic. Using
only two types of gates is convenient if the circuit is being implemented using simple IC
chips which contain only one gate type per chip. In this light, Cout can be implemented.

A full adder can be implemented in many different ways such as with a custom transistor-
level circuit or composed of other gates. In this implementation, the final OR gate before the
carry-out output may be replaced by an XOR gate without altering the resulting logic. Using only
two types of gates is convenient if the circuit is being implemented using simple IC chips which
contain only one gate type per chip. In this light, Cout can be implemented.A full adder can be
constructed from two half adders by connecting A and B to the input of one half adder, connecting
the sum from that to an input to the second adder, connecting Ci to the other input and OR the two
carry outputs. Equivalently, S could be made the three-bit XOR of A, B, and Ci, and Cout could be
made the three-bit majority function of A, B, and Ci.

3.2 Ripple Carry Adder/Subtractor:

A simple ripple carry adder is a digital circuit that produces the arithmetic sum of two binary
numbers. It can be constructed with full adders connected in cascade, with the carry output from
each full adder connected to the carry input of the next full adder in the chain. Figure 2 shows the
interconnection of four full adder (FA) circuits to provide a 4-bit ripple carry adder. Notice from
Figure 2 that the input is from the right side because the first cell traditionally represents the least
significant bit (LSB). Bits a0 and b0 in the figure represent the least significant bits of the
numbers to be added. The sum output is represented by the bits s0 –s3 . The main problem with
this type of adder is the delays needed to produce the carry out signal and the most significant
bits. These delays increase with the increase in the number of bits to be added.

Fig 3.4 .4- bit full adder


3.3 Carry Lookahead Adder (CLA):
The carry lookahead adder (CLA) solves the carry delay problem by calculating the carry signals
in advance, based on the input signals. It is based on the fact that a carry signal will be generated
in two cases: (1) when both bits ai and bi are 1, or (2) when one of the two bits is 1 and the carry-
in is 1 . Thus, one can write,

Fig 3.5 carry lookahed adder


gi and pi are called the carry generate and carry propagate terms, respectively. Notice that the
generate and propagate terms only depend on the input bits and thus will be valid after one and
two gate delay, respectively. If one uses the above expression to calculate the carry signals, one
does not need to wait for the carry to ripple through all the previous stages to find its proper
value. Let’s apply this to a 4-bit adder to make it clear.

Fig .3.6 carry lookahed adder(using gates)


Design of area efficient high speed data path logic systems are one of the most important
areas of research in VLSI. In digital adders, the speed of addition is controlled by the time
required to propagate a carry through the adder. The sum for each bit position in an elementary
adder is generated sequentially only after the previous bit position was summed and a carry
propagated into the next position. Bedriji proposed that the problem of carry propagation delay is
overcome by independently generating multiple radix carries and using this carries to select
between simultaneously generated sums. Akhilash Tyagi introduced a scheme to generate carry
bits with block carry in 1 from the carries of a block with block carry in 0. Chang and Hsiao
proposed that instead of using dual RCA, a CSLA scheme using an add one circuit to replace one
RCA. Youngioon Kim and Lee Sup Kim introduced a multiplexer based add one circuit was
proposed to reduce the area with negligible speed penalty. Yajuan He et al proposed an area
efficient Square-root CSLA (SQRT CSLA) scheme based on a new first zero detection logic.
Ramkumar and Harish proposed Binary to Excess 1 converter (BEC) technique, which is a simple
and efficient gate level modification to significantly reduce the area of SQRT CSLA. Padma Devi
et al proposed modified CSLA designed in different stages which reduces the area. CSLA is used
in many computational systems to relieve the problem of carry propagation delay by
independently generating multiple carries and then select a carry to generate the sum. However,
the CSLA is not area efficient because it uses multiple pairs of RCA to generate partial sum and
carry by considering carry in 0 and carry in 1, then the final sum and carry are selected by the
multiplexers (MUX). The basic idea of this work is to use BEC instead of RCA with carry in 1 in
the regular CSLA to achieve lower area. The main benefit of BEC comes from the lesser number
of logic gates than the n-bit Full Adder (FA).

3.4 Carry-save adder:


A carry save adder is a type of a digital adder, used in computer micro architecture to
compute the sum of three or more n-bit numbers in binary. It differs from other digital adders in
that out puts of two numbers of same dimensions the input. One is sum and other is carry.

Consider the sum is

87644322

+12355678

=100000000

Using arithmetic addition, we go from right to left, 2+8=0, c-1, 2+7+1=0,c-1 upto end of
sum. Here adding two n- digit numbers take more time.
In electronic terms, using binary bits, this means that even if we have n one-bit adders at our
disposal, we still have to allow a time proportional to n to allow a possible carry to propagate
from one end of the number to the other. Until we have done this,

1 .We do not know the result of the addition.

2. We do not know whether the result of the addition is larger or smaller than a given number (for
instance, we do not know whether it is positive or negative).

A carry look-ahead adder can reduce the delay. In principle the delay can be reduced so
that it is proportional to logn, but for large numbers this is no longer the case, because even when
carry look-ahead is implemented, the distances that signals have to travel on the chip increase in
proportion to n, and propagation delays increase at the same rate. Once we get to the 512-bit to
2048-bit number sizes that are required in public-key cryptography, carry look-ahead is not of
much help.

3.4.1 Basic concept:

Binarysum:
10111010101011011111000000001101
+ 11011110101011011011111011101111.

Carry-save arithmetic works by abandoning the binary notation while still working to base 2.
10111010101011011111000000001101
+11011110101011011011111011101111
= 21122120202022022122111011102212.

The notation is unconventional but the result is still unambiguous. Moreover, given n
adders (here, n=32 full adders), the result can be calculated in a single tick of the clock, since
each digit result does not depend on any of the others.If the adder is required to add two numbers
and produce a result, carry-save addition is useless, since the result still has to be converted back
into binary and this still means that carries have to propagate from right to left. But in large-
integer arithmetic, addition is a very rare operation, and adders are mostly used to accumulate
partial sums in a multiplication.

3.4.2 Carry-save accumulators:

Supposing that we have two bits of storage per digit, we can use a redundant binary
representation, storing the values 0, 1, 2, or 3 in each digit position. It is therefore obvious that
one more binary number can be added to our carry-save result without overflowing our storage
capacity.

The key to success is that at the moment of each partial addition we add three bits:

 0 or 1, from the number we are adding.


 0 if the digit in our store is 0 or 2, or 1 if it is 1 or 3.
 0 if the digit to its right is 0 or 1, or 1 if it is 2 or 3.
To put it another way, we are taking a carry digit from the position on our right, and
passing a carry digit to the left, just as in conventional addition; but the carry digit we pass to the
left is the result of the previous calculation and not the current one. In each clock cycle, carries
only have to move one step along, and not n steps as in conventional addition. Because signals
don't have to move as far, the clock can tick much faster.There is still a need to convert the result
to binary at the end of a calculation, which effectively just means letting the carries travel all the
way through the number just as in a conventional adder. But if we have done 512 additions in the
process of performing a 512-bit multiplication, the cost of that final conversion is effectively split
across those 512 additions, so each addition bears 1/512 of the cost of that final "conventional"
addition.

3.4.3 Drawbacks:
At each stage of a carry-save addition, We know the result of the addition at once. We
still do not know whether the result of the addition is larger or smaller than a given number (for
instance, we do not know whether it is positive or negative).This latter point is a drawback when
using carry-save adders to implement modular multiplication (multiplication followed by
division, keeping the remainder only). If we cannot know whether the intermediate result is
greater or less than the modulus, how can we know whether to subtract the modulus or
not.Montgomery multiplication, which depends on the rightmost digit of the result, is one
solution; though rather like carry-save addition itself, it carries a fixed overhead, so that a
sequence of Montgomery multiplications saves time but a single one does not. Fortunately
exponentiation, which is effectively a sequence of multiplications, is the most common operation
in public-key cryptography.The entire sum can then be computed by Shifting the carry sequence
sc left by one place. Appending a 0 to the front (most significant bit) of the partial sum sequence
ps. Using a ripple carry adder to add these two together and produce the resulting n + 1-bit value.
When adding together three or more numbers, using a carry-save adder followed by a ripple carry
adder is faster than using two ripple carry adders. This is because a ripple carry adder cannot
compute a sum bit without waiting for the previous carry bit to be produced, and thus has a delay
equal to that of n full adders. A carry-save adder, however, produces all of its output values in
parallel, and thus has the same delay as a single full-adder. Thus the total computation time (in
units of full-adder delay time) for a carry-save adder plus a ripple carry adder is n + 1, whereas
for two ripple carry adders it would be 2n.
4.1 INTRODUCTION :

VLSI Design presents state-of-the-art papers in VLSI design, computer-aided design,


design analysis, design implementation, simulation and testing. Its scope also includes papers that
address technical trends, pressing issues, and educational aspects in VLSI Design. The
development of microelectronics spans a time which is even lesser than the average life
expectancy of a human, and yet it has seen as many as four generations. Early 60’s saw the low
density fabrication processes classified under Small Scale Integration (SSI) in which transistor
count was limited to about 10. This rapidly gave way to Medium Scale Integration in the late 60’s
when around 100 transistors could be placed on a single chip. It was the time when the cost of
research began to decline and private firms started entering the competition in contrast to the
earlier years where the main burden was borne by the military. Transistor-Transistor logic (TTL)
offering higher integration densities outlasted other IC families like ECL and became the basis of
the first integrated circuit revolution. It was the production of this family that gave impetus to
semiconductor giants like Texas Instruments, Fairchild and National Semiconductors. Early
seventies marked the growth of transistor count to about 1000 per chip called the Large Scale
Integration.

By mid eighties, the transistor count on a single chip had already exceeded 1000 and
hence came the age of Very Large Scale Integration or VLSI. Though many improvements have
been made and the transistor count is still rising, further names of generations like ULSI are
generally avoided. It was during this time when TTL lost the battle to MOS family owing to the
same problems that had pushed vacuum tubes into negligence, power dissipation and the limit it
imposed on the number of gates that could be placed on a single die.The second age of Integrated
Circuits revolution started with the introduction of the first microprocessor, the 4004 by Intel in
1972 and the 8080 in 1974. Today many companies like Texas Instruments, Infineon, Alliance
Semiconductors, Cadence, Synopsys, Celox Networks, Cisco, Micron Tech, National
Semiconductors, ST Microelectronics, Qualcomm, Lucent, Mentor Graphics, Analog Devices,
Intel, Philips, Motorola and many other firms have been established and are dedicated to the
various fields in "VLSI" like Programmable Logic Devices, Hardware Descriptive Languages,
Design tools, Embedded Systems etc.

In 1980s hold-over from outdated taxonomy for integration levels. Obviously influenced
from frequency bands, i.e. HF, VHF, and UHF. Sources disagree on what is measured (gates or
transistors)
 SSI – Small-Scale Integration (0-102)
 MSI – Medium-Scale Integration (102 -103)
 LSI – Large-Scale Integration (103 -105)
 VLSI – Very Large-Scale Integration (105 - 107)
 ULSI – Ultra Large-Scale Integration (>= 107)
VLSI Technology, Inc was a company which designed and manufactured custom and
semi-custom ICs. The company was based in Silicon Valley, with headquarters at 1109 McKay
Drive in San Jose, California. Along with LSI Logic, VLSI Technology defined the leading edge
of the application-specific integrated circuit (ASIC) business, which accelerated the push of
powerful embedded systems into affordable products. The company was founded in 1979 by a
trio from Fairchild Semiconductor by way of Synertek - Jack Balletto, Dan Floyd, Gunnar
Wetlesen - and by Doug Fairbairn of Xerox PARC and Lambda (later VLSI Design) magazine.
The first semiconductor chips held two transistors each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or systems were
integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten
diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic
gates on a single device. Now known retrospectively as small-scale integration (SSI),
improvements in technique led to devices with hundreds of logic gates, known as medium-scale
integration (MSI). Further improvements led to large-scale integration (LSI), i.e. systems with at
least a thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge
number of gates and transistors available on common devices has rendered such fine distinctions
moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.
As of early 2008, billion-transistor processors are commercially available. This is
expected to become more commonplace as semiconductor fabrication moves from the current
generation of 65 nm processes to the next 45 nm generations (while experiencing new challenges
such as increased variation across process corners). A notable example is Nvidia's 280 series
GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic,
in contrast to the Itanium, whose large transistor count is largely due to its 24 MB L3 cache.
Certain high-performance logic blocks like the SRAM (Static Random Access Memory) cell,
however, are still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading
stability)[citation needed]. VLSI technology is moving towards radical level miniaturization with
introduction of NEMS technology. A lot of problems need to be sorted out before the transition is
actually made.

4.2 WHY VLSI?

The course will cover basic theory and techniques of digital VLSI design in CMOS
technology. Topics include: CMOS devices and circuits, fabrication processes, static and dynamic
logic structures, chip layout, simulation and testing, low power techniques, design tools and
methodologies, VLSI architecture. We use full-custom techniques to design basic cells and
regular structures such as data-path and memory. There is an emphasis on modern design issues
in interconnect and clocking. We will also use several case-studies to explore recent real-world
VLSI designs (e.g. Pentium, Alpha, PowerPC Strong ARM, etc.) and papers from the recent
research literature. On-campus students will design small test circuits using various CAD tools.
Circuits will be verified and analyzed for performance with various simulators. Some final project
designs will be fabricated and returned to students the following semester for testing. Very-large-
scale integration (VLSI) is the process of creating integrated circuits by combining thousands of
transistor-based circuits into a single chip. VLSI began in the 1970s when complex
semiconductor and communication technologies were being developed. The microprocessor is a
VLSI device. The term is no longer as common as it once was, as chips have increased in
complexity into the hundreds of millions of transistors.

The first semiconductor chips held one transistor each. Subsequent advances added more
and more transistors, and, as a consequence, more individual functions or systems were integrated
over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes,
transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a
single device. Now known retrospectively as "small-scale integration" (SSI), improvements in
technique led to devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.
systems with at least a thousand logic gates. Current technology has moved far past this mark and
today's microprocessors have many millions of gates and hundreds of millions of individual
transistors.

At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were used. But the huge
number of gates and transistors available on common devices has rendered such fine distinctions
moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.
Even VLSI is now somewhat quaint, given the common assumption that all microprocessors are
VLSI or better.As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65 nm
processes to the next 45 nm generations (while experiencing new challenges such as increased
variation across process corners). Another notable example is NVIDIA’s 280 series GPU.

This microprocessor is unique in the fact that its 1.4 Billion transistor count, capable of a
teraflop of performance, is almost entirely dedicated to logic (Itanium's transistor count is largely
due to the 24MB L3 cache). Current designs, as opposed to the earliest devices, use extensive
design automation and automated logic synthesis to lay out the transistors, enabling higher levels
of complexity in the resulting logic functionality. Certain high-performance logic blocks like the
SRAM cell, however, are still designed by hand to ensure the highest efficiency (sometimes by
bending or breaking established design rules to obtain the last bit of performance by trading
stability). Thanks to its Caltech and UC Berkeley students, VLSI was an important pioneer in the
electronic design automation industry. It offered a sophisticated package of tools, originally based
on the 'lambda-based' design style advocated by Carver Mead and Lynn Conway. VLSI became
an early vendor of standard cell (cell-based technology) to the merchant market in the early 80s
where the other ASIC-focused company, LSI Logic, was a leader in gate arrays. Prior to VLSI's
cell-based offering, the technology had been primarily available only within large vertically
integrated companies with semiconductor units such as AT&T and IBM.

VLSI's design tools eventually included not only design entry and simulation but
eventually cell-based routing (chip compiler), a data path compiler, SRAM and ROM compilers,
and a state machine compiler. The tools were an integrated design solution for IC design and not
just point tools, or more general purpose system tools. Characterization tools were integrated to
generate Frame Maker Data Sheets for Libraries. VLSI eventually spun off the CAD and Library
operation into Compass Design Automation but it never reached IPO before it was purchased by
Avanti Corp.

VLSI's physical design tools were critical not only to its ASIC business, but also in
setting the bar for the commercial EDA industry. When VLSI and its main ASIC competitor, LSI
Logic, were establishing the ASIC industry, commercially-available tools could not deliver the
productivity necessary to support the physical design of hundreds of ASIC designs each year
without the deployment of a substantial number of layout engineers. The EDA industry finally
caught up in the late 1980s when Tangent Systems released its Tan Cell and Tan Gate products.

VLSI had not been timely in developing a 1.0 µm manufacturing process as the rest of
the industry moved to that geometry in the late 80s. VLSI entered a long-term technology
partnership with Hitachi and finally released a 1.0 µm process and cell library (actually more of a
1.2 µm library with a 1.0 µm gate).As VLSI struggled to gain parity with the rest of the industry
in semiconductor technology, the design flow was moving rapidly to a Verilog HDL and
synthesis flow. Cadence acquired Gateway, the leader in Verilog hardware design language
(HDL) and Synopsys was dominating the exploding field of design synthesis. As VLSI's tools
were being eclipsed, VLSI waited too long to open the tools up to other fabs and Compass Design
Automation was never a viable competitor to industry leaders.

Scientists and innovations from the 'design technology' part of VLSI found their way to
Cadence Design Systems (by way of Redwood Design Automation). Compass Design
Automation (VLSI's CAD and Library spin-off) was sold to Avant! Corporation, which itself was
acquired by Synopsys.

4.2.1 Structured design

Structured VLSI design is a modular methodology originated by Carver Mead and Lynn
Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained
by repetitive arrangement of rectangular macro blocks which can be interconnected using wiring
by abutment. An example is partitioning the layout of an adder into a row of equal bit slices cells.
In complex designs this structuring may be achieved by hierarchical nesting.Structured VLSI
design had been popular in the early 1980s, but lost its popularity later because of the advent of
placement and routing tools wasting a lot of area by routing, which is tolerated because of the
progress of Moore's Law. When introducing the hardware description language KARL in the mid'
1970s, Reiner Hartenstein coined the term "structured VLSI design" (originally as "structured LSI
design"), echoing Edsger Dijkstra's structured programming approach by procedure nesting to
avoid chaotic spaghetti-structured programs.
4.3 WHAT IS VLSI?

VLSI stands for "Very Large Scale Integration". This is the field which involves packing
more and more logic devices into smaller and smaller areas.

 Simply we say Integrated circuit is many transistors on one chip.


 Design/manufacturing of extremely small, complex circuitry using modified
semiconductor material
 Integrated circuit (IC) may contain millions of transistors, each a few mm in size
 Applications wide ranging: most electronic logic devices

4.3.1 History of Scale Integration

 Late 40s Transistor invented at Bell Labs


 Late 50s First IC (JK-FF by Jack Kilby at TI)
 Early 60s Small Scale Integration (SSI)
 10s of transistors on a chip
 Late 60s Medium Scale Integration (MSI)
 100s of transistors on a chip
 Early 70s Large Scale Integration (LSI)
 1000s of transistor on a chip
 Early 80s VLSI 10,000s of transistors on a
 chip (later 100,000s & now 1,000,000s)
 Ultra LSI is sometimes used for 1,000,000s
4.3.2 System Design
 Create a high-level (Behavioral) representation of your system
 Tools: Verilog, VHDL, System C
 Synthesizable (PLD’s and/or ASIC)
 Non-synthesizable
 More in future lectures

4.3.3 Metal-oxide-semiconductor (MOS) and related VLSI technology

PMOS, NMOS, CMOS, BiCMOS, GaAs are widely used mos transistors for IC
fabrication. Basic MOS Transistors are implemented as minimum line width, transistor cross
section, Charge inversion channel, Source connected to substrate, Enhancement vs. Depletion
mode devices, PMOS are 2.5 time slower than NMOS due to electron and hole mobilities.
Fabrication Technology

It is Silicon of extremely high purity and chemically purified then grown into large
crystals. Wafers is type of crystals are sliced into wafers, and wafer diameter is currently 150mm,
200mm, 300mm and wafer thickness <1mm and also surface is polished to optical smoothness.
Wafer is then ready for processing, Each wafer will yield many chips and the chip die size varies
from about 5mmx5mm to 15mmx15mm. A whole wafer is processed at a time, Different parts of
each die will be made P-type or N-type (small amount of other atoms intentionally introduced -
doping -implant).Interconnections are made with metal insulation used is typically SiO2. SiN is
also used. New materials being investigated (low-k dielectrics).

In CMOS Fabrication p-well process, n-well process and twin-tub process are used. All
the devices on the wafer are made at the same time. After the circuitry has been placed on the
chip, the chip is over glassed (with a passivation layer) to protect it only those areas which
connect to the outside world will be left uncovered (the pads). The wafer finally passes to a test
station test probes send test signal patterns to the chip and monitor the output of the chip.The
yield of a process is the percentage of die which pass this testing, the wafer is then scribed and
separated up into the individual chips. These are then packaged and Chips are ‘binned’ according
to their performance

4.4 DESIGN OF VLSI

The complexity of VLSIs being designed and used today makes the manual approach to
design impractical. Design automation is the order of the day. With the rapid technological
developments in the last two decades, the status of VLSI technology is characterized by the
following.

1. A steady increase in the size and hence the functionality of the ICs.
2. A steady reduction in feature size and hence increase in the speed of operation as well
as gate or transistor density.
3. A steady improvement in the predictability of circuit behavior.
4. A steady increase in the variety and size of software tools for VLSI design.

The above developments have resulted in a proliferation of approaches to VLSI design.


We briefly describe the procedure of automated design flow The aim is more to bring out the role
of a Hardware Description Language (HDL) in the design process. An abstraction based model is
the basis of the automated design.The model divides the whole design cycle into various domains.
With such an abstraction through a division process the design is carried out in different layers.
The designer at one layer can function without bothering about the layers above or below. The
thick horizontal lines separating the layers in the figure signify the compartmentalization. As an
example, let us consider design at the gate level.

The circuit to be designed would be described in terms of truth tables and state tables.
With these as available inputs, he has to express them as Boolean logic equations and realize
them in terms of gates and flip-flops. Compartmentalization of the approach to design in the
manner described here is the essence of abstraction; it is the basis for development and use of
CAD tools in VLSI design at various levels. The design methods at different levels use the
respective aids such as Boolean equations, truth tables, state transition table, etc. But the aids play
only a small role in the process. To complete a design, one may have to switch from one tool to
another, raising the issues of tool compatibility and learning new environments.

4.4.1 CHALLENGES :

As microprocessors become more complex due to technology scaling, microprocessor


designers have encountered several challenges which force them to think beyond the design
plane, and look ahead to post-silicon:

•Power usage/Heat dissipation – As threshold voltages have ceased to scale with advancing
process technology, dynamic power dissipation has not scaled proportionally. Maintaining logic
complexity when scaling the design down only means that the power dissipation per area will go
up. This has given rise to techniques such as dynamic voltage and frequency scaling (DVFS) to
minimize overall power.

4.5 VLSI AND SYSTEMS

These following are advantages of integrated circuits at the system level:

a) Smaller physical size

Smallness is often an advantage in itself-consider portable televisions or handheld cellular


telephones.
b) Lower power consumption

Replacing a handful of standard parts with a single chip reduces total power consumption.
Reducing power consumption has a ripple effect on the rest of the system: a smaller, cheaper
power supply can be used; since less power consumption means less heat, a fan may no longer be
necessary; a simpler cabinet with less shielding for electromagnetic shielding may be feasible,
too.

c) Reduced cost

Reducing the number of components, the power supply requirements, cabinet costs, and
so on, will inevitably reduce system cost. The ripple effect of integration is such that the cost of a
system built from custom ICs can be less, even though the individual ICs cost more than the
standard parts they replace.

4.6 APPLICATIONS OF VLSI

 Electronic system in cars.

 Digital electronics control VCRs

 Transaction processing system, ATM

 Personal computers and Workstations

 Medical electronic systems.

Electronic systems now perform a wide variety of tasks in daily life. Electronic systems
in some cases have replaced mechanisms that operated mechanically, hydraulically, or by other
means; electronics are usually smaller, more flexible, and easier to service. In other cases
electronic systems have created totally new applications. Electronic systems perform a variety of
tasks, some of them visible, some more hidden:

Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control functions
required for anti-lock braking systems. Digital electronics compress and decompress video, even
at high-definition data rates, on-the-fly in consumer electronics. Low-cost terminals for Web
browsing still require sophisticated electronics, despite their dedicated function. Personal
computers and workstations provide word-processing, financial analysis, and games. Computers
include both central processing units and special-purpose hardware for disk access, faster screen
display, etc.
The growing sophistication of applications continually pushes the design and
manufacturing of integrated circuits and electronic systems to new levels of complexity. And
perhaps the most amazing characteristic of this collection of systems is its variety-as systems
become more complex, we build not a few general-purpose computers but an ever wider range of
special-purpose systems.

4.7 ASIC

An Application-Specific Integrated Circuit (ASIC) is an integrated circuit (IC)


customized for a particular use, rather than intended for general-purpose use. For example, a chip
designed solely to run a cell phone is an ASIC. Intermediate between ASICs and industry
standard integrated circuits, like the 7400 or the 4000 series, are application specific standard
products (ASSPs).

As feature sizes have shrunk and design tools improved over the years, the maximum
complexity (and hence functionality) possible in an ASIC has grown from 5,000 gates to over 100
million. Modern ASICs often include entire 32-bit processors, memory blocks including ROM,
RAM, EEPROM, Flash and other large building blocks. Such an ASIC is often termed a SOC
(system-on-a-chip). Designers of digital ASICs use a hardware description language (HDL), such
as Verilog or VHDL, to describe the functionality of ASICs.

Field-programmable gate arrays (FPGA) are the modern-day technology for building a
breadboard or prototype from standard parts; programmable logic blocks and programmable
interconnects allow the same FPGA to be used in many different applications. For smaller
designs and/or lower production volumes, FPGAs may be more cost effective than an ASIC
design even in production.

4.8 ASIC DESIGN FLOW

As with any other technical activity, development of an ASIC starts with an idea and
takes tangible shape through the stages of development. The first step in the process is to expand
the idea in terms of behavior of the target circuit. Through stages of programming, the same is
fully developed into a design description in terms of well defined standard constructs and
conventions.

The design is tested through a simulation process; it is to check, verify, and ensure that
what is wanted is what is described. Simulation is carried out through dedicated tools. With every
simulation run, the simulation results are studied to identify errors in the design description. The
errors are corrected and another simulation run carried out. Simulation and changes to design
description together form a cyclic iterative process, repeated until an error-free design is evolved.

Fig 4.1 ASIC Design Flow

Design description is an activity independent of the target technology or manufacturer. It


results in a description of the digital circuit. To translate it into a tangible circuit, one goes
through the physical design process. The same constitutes a set of activities closely linked to the
manufacturer and the target technology.

4.9 CMOS TECHNOLOGY

In the present decade the chips being designed are made from CMOS technology. CMOS
is Complementary Metal Oxide Semiconductor. It consists of both NMOS and PMOS transistors.
To understand CMOS better, we first need to know about the MOS transistor.
4.9.1 MOS Transistor

MOS stands for Metal Oxide Semiconductor field effect transistor. MOS is the basic
element in the design of a large scale integrated circuit is the transistor. It is a voltage controlled
device. These transistors are formed as a ``sandwich'' consisting of a semiconductor layer, usually
a slice, or wafer, from a single crystal of silicon; a layer of silicon dioxide (the oxide) and a layer
of metal. These layers are patterned in a manner which permits transistors to be formed in the
semiconductor material (the ``substrate''). The MOS transistor consists of three regions, Source,
Drain and Gate. The source and drain regions are quite similar, and are labeled depending on to
what they are connected. The source is the terminal, or node, which acts as the source of charge
carriers; charge carriers leave the source and travel to the drain. In the case of an N channel
MOSFET (NMOS), the source is the more negative of the terminals; in the case of a P channel
device (PMOS), it is the more positive of the terminals. The area under the gate oxide is called
the ``channel”. Below is figure of a MOS Transistor.

Fig 4.2 MOS Transistor

CMOS technology is made up of both NMOS and CMOS transistors. Complementary


Metal-Oxide Semiconductors (CMOS) logic devices are the most common devices used today in
the high density, large number transistor count circuits found in everything from complex
microprocessor integrated circuits to signal processing and communication circuits. The CMOS
structure is popular because of its inherent lower power requirements, high operating clock speed,
and ease of implementation at the transistor level. The complementary p-channel and n-channel
transistor networks are used to connect the output of the logic device to the either the VDD or
VSS power supply rails for a given input logic state. The MOSFET transistors can be treated as
simple switches. The switch must be on (conducting) to allow current to flow between the source
and drain terminals.

In CMOS, there is only one driver, but the gate can drive as many gates as possible. In
CMOS technology, the output always drives another CMOS gate input. The charge carriers for
PMOS transistors is ‘holes’ and charge carriers for NMOS are electrons. The mobility of
electrons is two times more than that of ‘holes’. Due to this the output rise and fall time is
different. To make it same, the W/L ratio of the PMOS transistor is made about twice that of the
NMOS transistor. This way, the PMOS and NMOS transistors will have the same ‘drive
strength’. The resistance is proportional to (L/W). Therefore, if the increasing the width,
decreases the resistance.

4.9.2 Power Dissipation in CMOS IC’s

The big percentage of power dissipation in CMOS IC’s is due to the charging and
discharging of capacitors. Majority of the low power CMOS IC designs issue is to reduce power
dissipation. The main sources of power dissipation are:

1. Dynamic Switching Power - Due to charging and discharging of circuit capacitances. A low
to high output transition draws energy from the power supply. A high to low transition dissipates
energy stored in CMOS transistor.

2. Short Circuit Current - It occurs when the rise/fall time at the input of the gate is larger than
the output rise/fall time.

3. Leakage Current Power - It is caused by two reasons a. Reverse-Bias Diode Leakage on


Transistor Drains: This happens in CMOS design, when one transistor is off, and the active
transistor charges up/down the drain using the bulk potential of the other transistor.

4.9.3 CMOS Transmission Gate

A PMOS transistor is connected in parallel to a NMOS transistor to form a Transmission


gate. The transmission gate just transmits the value at the input to the output. It consists of both
NMOS and PMOS because, PMOS transistor transmits a strong ‘1’ and NMOS transistor
transmits a strong ‘0’.
4.10 SIMPLE ASIC DESIGN FLOW

For any design to work at a specific speed, timing analysis has to be performed. We need
to check whether the design is meeting the speed requirement mentioned in the specification. This
is done by Static Timing Analysis Tool, for example Primetime.

Fig 4.3 Implementation of Chip Design

4.10.1 Register Transfer Logic

RTL is expressed in Verilog or VHDL. This document will cover the basics of Verilog.
Verilog is a Hardware Description Language (HDL). A hardware description language is a
language used to describe a digital system example Latches, Flip-Flops, Combinatorial,
Sequential Elements etc… Basically you can use Verilog to describe any kind of digital system.
One can design a digital system in Verilog using any level of abstraction. The most important
levels are
 Behavior Level

This level describes a system by concurrent algorithms (Behavioral). Each algorithm


itself is sequential, that means it consists of a set of instructions that are executed one after the
other. There is no regard to the structural realization of the design.

 Register Transfer Level (RTL)

Designs using the Register-Transfer Level specify the characteristics of a circuit by


transfer of data between the registers, and also the functionality; for example Finite State
Machines. An explicit clock is used. RTL design contains exact timing possibility; and data
transfer is scheduled to occur at certain times.

 Gate level

The system is described in terms of gates (AND, OR, NOT, NAND etc…). The signals
can have only these four logic states (‘0’,’1’,’X’,’Z’). The Gate Level design is normally not done
because the output of Logic Synthesis is the gate level net list.

4.10.2 Optimization

The circuit at the gate level – in terms of the gates and flip-flops can be redundant in
nature. The same can be minimized with the help of minimization tools. The minimized logical
design is converted to a circuit in terms of the switch level cells from standard libraries provided
by the foundries. The cell based design generated by the tool is the last step in the logical design
process.

4.11 FPGA DESIGN FLOW

The designer facing a design problem must go through a series of steps between initial
ideas and final hardware. This series of steps is commonly referred to as the ‘design flow’. First,
after all the requirements have been spelled out, a proper digital design phase must be carried out.
It should be stressed that the tools supplied by the different FPGA vendors to target their chips do
not help the designer in this phase. They only enter the scene once the designer is ready to
translate a given design into working hardware.

The most common flow nowadays used in the design of FPGAs involves the following
subsequent phases:
Design entry - This step consists in transforming the design ideas into some form of
computerized representation. This is most commonly accomplished using Hardware Description
Languages (HDLs). The two most popular HDLs are Verilog and the Very High Speed Integrated
Circuit HDL (VHDL) [2]. It should be noted that an HDL, as its name implies, is only a tool to
describe a design that pre-existed in the mind, notes, and sketches of a designer. It is not a tool to
design electronic circuits. Another point to note is that HDLs differ from conventional software
programming languages in the sense that they don’t support the concept of sequential execution
of statements in the code. This is easy to understand if one considers the alternative schematic
representation of an HDL file: what one sees in the upper part of the schematic cannot be said to
happen before or after what one sees in the lower part.

Synthesis - The synthesis tool receives HDL and a choice of FPGA vendor and model. From
these two pieces of information, it generates a netlist which uses the primitives proposed by the
vendor in order to satisfy the logic behaviour specified in the HDL files. Most synthesis tools go
through additional steps such as logic optimization, register load balancing, and other techniques
to enhance timing performance, so the resulting netlist can be regarded as a very efficient
implementation of the HDL design.

Place and route - The placer takes the synthesized net list and chooses a place for each of the
primitives inside the chip. The router’s task is then to interconnect all these primitives together
satisfying the timing constraints. The most obvious constraint for a design is the frequency of the
system clock, but there are more involved constraints one can impose on a design using the
software packages supported by the vendors.

Bit stream generation - FPGAs are typically configured at power-up time from some sort of
external permanent storage device, typically a flash memory. Once the place and route process is
finished, the resulting choices for the configuration of each programmable element in the FPGA
chip, be it logic or interconnect, must be stored in a file to program the flash. Of these four
phases, only the first one is human-labour intensive. Somebody has to type in the HDL code,
which can be tedious and error-prone for complicated designs involving, for example, lots of
digital signal processing. This is the reason for the appearance, in recent years, of alternative
flows which include a preliminary phase in which the user can draw blocks at a higher level of
abstraction and rely on the software tool for the generation of the HDL.
The standard FPGA design flow starts with design entry using schematics or a hardware
description language (HDL), such as Verilog HDL or VHDL. In this step, you create the digital
circuit that is implemented inside the FPGA. The flow then proceeds through compilation,
simulation, programming, and verification in the FPGA hardware we first define the relevant
terminology in the field and then describe the recent evolution of FPDs. The three main
categories of FPDs are delineated: Simple PLDs (SPLDs), Complex PLDs (CPLDs) and Field-
Programmable Gate Arrays (FPGAs).

4.11.1 Introduction to High-Capacity FPDs

Prompted by the development of new types of sophisticated field-programmable devices,


the process of designing digital hardware has changed dramatically over the past few years.
Unlike previous generations of technology, in which board-level designs included large numbers
of SSI chips containing basic gates, virtually every digital design produced today consists mostly
of high-density devices. This applies not only to custom devices like processors and memory, but
also for logic circuits such as state machine controllers, counters, registers, and decoders. When
such circuits are destined for high-volume systems they have been integrated into high-density
gate arrays. However, gate array NRE costs often are too expensive and gate arrays take too long
to manufacture to be viable for prototyping or other low-volume scenarios. For these reasons,
most prototypes, and also many production designs are now built using field-programmable
devices. The most compelling advantages of field-programmable devices are instant
manufacturing turnaround, low start-up costs, low financial risk and (since programming is done
by the end user) ease of design changes.

The market for FPDs has grown dramatically over the past decade to the point where
there is now a wide assortment of devices to choose from. A designer today faces a daunting task
to research the different types of chips, understand what they can best be used for, choose a
particular manufacturer’s product, learn the intricacies of vendor-specific software and then
design the hardware. Confusion for designers is exacerbated by not only the sheer number of
field-programmable devices available, but also by the complexity of the more sophisticated
devices. The purpose of this paper is to provide an overview of the architecture of the various
types of field-programmable devices. The emphasis is on devices with relatively high logic
capacity. Before proceeding, we provide definitions of the terminology in this field. This is
necessary because the technical jargon has become somewhat inconsistent over the past few years
as companies have attempted to compare and contrast their products in literature.
4.11.2 Definitions of Relevant Terminology

The most important terminology used in this paper is defined below.

• Field-Programmable Device (FPD): A general term that refers to any type of integrated circuit
used for implementing digital hardware, where the chip can be configured by the end user to
realize different designs. Programming of such a device often involves placing the chip into a
special programming unit, but some chips can also be configured “in-system”. Another name for
FPDs is programmable logic devices (PLDs); although PLDs encompass the same types of chips
as FPDs, we prefer the term FPD.

• PLA : Programmable Logic Array (PLA) is a relatively small FPD that contains two levels of
logic, an AND-plane and an OR-plane, where both levels are programmable .

• PAL: Programmable Array Logic (PAL) is a relatively small FPD that has a programmable
AND-plane followed by a fixed OR-plane.

• SPLD: It refers to any type of Simple PLD, usually either a PLA or PAL.

• CPLD: A more Complex PLD that consists of an arrangement of multiple SPLD-like blocks on
a single chip. Alternative names (that will not be used in this paper) sometimes adopted for this
style of chip are Enhanced PLD (EPLD), Super PAL, Mega PAL, and others.

• FPGA: Field-Programmable Gate Array is an FPD featuring a general structure that allows
Very high logic capacity. Whereas CPLDs feature logic resources with a wide number of inputs
(AND planes), FPGAs offer more narrow logic resources. FPGAs also offer a higher ratio of flip-
flops to logic resources than do CPLDs.

• Interconnect: The wiring resources in an FPD.

• Programmable Switch: A user-programmable switch that can connect a logic element to an


interconnect wire, or one interconnect wire to another.

• Logic Block: A relatively small circuit block that is replicated in an array in an FPD. When a
circuit is implemented in an FPD, it is first decomposed into smaller sub-circuits that can each be
mapped into a logic block. The term logic block is mostly used in the context of FPGAs, but it
could also refer to a block of circuitry in a CPLD.
• Logic Capacity: The amount of digital logic that can be mapped into a single FPD. This is
usually measured in units of “equivalent number of gates in a traditional gate array”. In other
words, the capacity of an FPD is measured by the size of gate array that it is comparable to. In
simpler terms, logic capacity can be thought of as “number of 2-input NAND gates”.

• Logic Density: The amount of logic per unit area in an FPD.

• Speed-Performance: It measures the maximum operable speed of a circuit when implemented


in an FPD. For combinational circuits, it is set by the longest delay through any path, and for
sequential circuits it is the maximum clock frequency for which the circuit functions properly.

4.12 BASIC FPGA ARCHITECTURE

The most common FPGA architecture consists of an array of configurable logic blocks
(CLBs), I/O pads, and routing channels. Generally, all the routing channels have the same width
(number of wires). Multiple I/O pads may fit into the height of one row or the width of one
column in the array.

An application circuit must be mapped into an FPGA with adequate resources. While the
number of CLBs and I/Os required is easily determined from the design, the number of routing
tracks needed may vary considerably even among designs with the same amount of logic. (For
example, a crossbar switch requires much more routing than a systolic array with the same gate
count.) Since unused routing tracks increase the cost (and decrease the performance) of the part
without providing any benefit, FPGA manufacturers try to provide just enough tracks so that most
designs that will fit in terms of LUTs and IOs can be routed. This is determined by estimates such
as those derived from Rent's rule or by experiments with existing designs.

4.12.1 FPGA DESIGN AND PROGRAMMING

To define the behavior of the FPGA, the user provides a hardware description language
(HDL) or a schematic design. The HDL form might be easier to work with when handling large
structures because it's possible to just specify them numerically rather than having to draw every
piece by hand. On the other hand, schematic entry can allow for easier visualization of a design.

Then, using an electronic design automation tool, a technology-mapped net list is


generated. The net list can then be fitted to the actual FPGA architecture using a process called
place-and-route, usually performed by the FPGA Company’s proprietary place-and-route
software. The user will validate the map, place and route results via timing analysis, simulation,
and other verification methodologies. Once the design and validation process is complete, the
binary file generated (also using the FPGA company's proprietary software) is used to
(re)configure the FPGA.

Going from schematic/HDL source files to actual configuration. The source files are fed
to a software suite from the FPGA/CPLD vendor that through different steps will produce a file.
This file is then transferred to the FPGA/CPLD via a serial interface or to an external memory
device like an EEPROM.

The most common HDLs are VHDL and Verilog, although in an attempt to reduce the
complexity of designing in HDLs, which have been compared to the equivalent of assembly
languages, there are moves to raise the abstraction level through the introduction of alternative
languages.

4.12.2 Advantages of Using Hardware Description Languages (HDLs) to Design FPGA


Devices

Using Hardware Description Languages (HDLs) to design high-density FPGA devices


has the following advantages:
• Top-Down Approach for Large Projects Designers use HDLs to create complex designs. The
top-down approach to system design works well for large HDL projects that require many
designers working together. After the design team determines the overall design plan, individual
designers can work independently on separate code sections.

• Functional Simulation Early in the Design Flow You can verify design functionality early in the
design flow by simulating the HDL description. Testing your design decisions before the design
is implemented at the Register Transfer Level (RTL) or gate level allows you to make any
necessary changes early on.

• Synthesis of HDL Code to Gates Synthesizing your hardware description to target the FPGA
implementation. Decreases design time by allowing a higher-level design specification, rather
than specifying the design from the FPGA base elements.

• Reduces the errors that can occur during a manual translation of a hardware description to a
schematic design.
• Allows you to apply the automation techniques used by the synthesis tool (such as machine
encoding styles and automatic I/O insertion) during optimization to the original HDL code. This
results in greater optimization and efficiency.

• Early Testing of Various Design Implementations HDLs allow you to test different design
implementations early in the design flow. Use the synthesis tool to perform the logic synthesis
and optimization into gates. Additionally, Xilinx FPGA devices allow you to implement your
design at your computer. Since the synthesis time is short, you have more time to explore
different architectural possibilities at the Register Transfer Level (RTL). You can reprogram
Xilinx FPGA devices to test several design implementations.

4.12.3 FPGA TO ASIC COMPARISONS

There have been a small number of past attempts to quantify the gap between FPGAs and
ASICs which we will review here. One of the earliest statements quantifying the gap between
FPGAs and pre-fabricated media was by Brown. That work reported the logic density gap
between FPGAs and Mask-programmable Gate Arrays (MPGAs) to be between 8 to 12 times,
and the circuit performance gap to be approximately a factor of 3.

The basis for these numbers was a cursory comparison of the largest available gate counts
in each technology, and the anecdotal reports of the approximate operating frequencies in the two
technologies at the time. While the latter may have been reasonable, the former from optimistic
gate counting in FPGAs. And seeking to measure the gap against standard cell implementations,
rather than the less common MPGA.

Aside from the reliance on anecdotal evidence, the analysis in is dated since it does not
include the impact of hard dedicated circuit structures such as multipliers and block memories
that are now common. In this work, we address this issue by explicitly considering the
incremental impact of such blocks. More recently, a detailed comparison of FPGA and ASIC
implementations was performed.

They found that the delay of an FPGA lookup table (LUT) was approximately 12 to 14
times the delay of an ASIC gate. ASIC gate density was found to be approximately 45 times
greater than that possible in FPGAs when measured in terms of kilo-gates per square micron.
Finally, the dynamic power consumption of a LUT was found to be over 500 times greater than
the power of an ASIC gate. Both the density and the power consumption exhibited variability
across process generations but the cause of such variability was unclear. The main issue with this
work is that it also depends on the number of gates that can be implemented by a LUT.

The area differences between FPGA and standard cell designs. They implemented
multiple circuits from eight different application domains, including areas such as radar and
image processing, on the Xilinx Virtex-II FPGA. Since the Xilinx Virtex-II is designed in 0:15
nm CMOS technology, the area results are scaled up to allow direct comparison with 0:18 nm
CMOS. Using this approach, they found that the FPGA implementation is only 7.2 times larger
on average than a standard cell implementation.

The measurements of the gaps between FPGAs and ASICs described in the previous
section were generally based on simple estimates or single-point comparisons. To provide a more
measurement, our approach is to implement a range of benchmark circuits in FPGAs and standard
cells with both designed using the same IC fabrication process geometry. This comparison was
performed using 90 nm CMOS technologies to implement a large set of benchmarks. This device
is fabricated using TSMC's Nexsys 90 nm process.

4.13 VHDL & VERILOG

Both VHDL and Verilog are well established hardware description languages. They have
the advantage that the user can define high-level algorithms and low-level optimizations (gate-
level and switch-level) in the same language. A basic example of VHDL code, the evaluation of
the Fibonacci series, is shown below, and it is a good example of the points made above. The
code itself is reasonably straightforward for a software programmer to understand, provided that
he/she understands that this is a truly parallel language and all lines are executing “at once”.

It is also straightforward to simulate a simple design of this nature. However it is


surprisingly difficult to implement it in hardware and this difficulty is a direct result of I/O issues.
As noted above for a design to work in hardware access is required to resources that are external
to the FPGA, such as memory, and an FPGA is, by its very nature, unaware of the components to
which it is connected. If you want to retrieve a value from main memory and use it on the FPGA
then you need to instantiate a memory controller. Our early experiences with VHDL have
indicated that it should only be used for FPGA development if you are in a position to work
closely with experienced hardware designers throughout the development process.
5.1 SYNTHESIS TOOL

5.1.1 XILINX ISE 13.1

Xilinx, Inc. is the world's largest supplier of programmable logic devices, the inventor of
the field programmable gate array (FPGA) and the first semiconductor company with a fabless
manufacturing model.

Xilinx designs, develops and markets programmable logic products including integrated
circuits (ICs), software design tools, predefined system functions delivered as intellectual
property (IP) cores, design services, customer training, field engineering and technical support.
Xilinx sells both FPGAs and CPLDs programmable logic devices for electronic equipment
manufacturers in end markets such as communications, industrial, consumer, automotive and data
processing. Xilinx's FPGAs have even been used for the ALICE (A Large Ion Collider
Experiment) at the CERN European laboratory on the French-Swiss border to map and
disentangle the trajectories of thousands of subatomic particles.

The Virtex-II Pro, Virtex-4, Virtex-5, and Virtex-6 FPGA families are particularly
focused on system-on-chip (SoC) designers because they include up to two embedded IBM
PowerPC cores.

The ISE Design Suite is the central electronic design automation (EDA) product family
sold by Xilinx. The ISE Design Suite features include design entry and synthesis supporting
Verilog or VHDL, place-and-route (PAR), completed verification and debug using Chip Scope
Pro tools, and creation of the bit files that are used to configure the chip.XST-Xilinx Synthesis
Technology performs device specific synthesis for CoolRunner XPLA3/-II and XC9500/XL/XV
families and generates an NGC file ready for the CPLD fitter.

The general flow of XST for CPLD synthesis is the following:

1. HDL synthesis of VHDL/Verilog designs


2. Macro inference
3. Module optimization
4. NGC file generation

Global CPLD Synthesis Options


This section describes supported CPLD families and lists the XST options related only to
CPLD synthesis that can only be set from the Process Properties dialog box within the Project
Navigator.

Families
Five families are supported by XST for CPLD synthesis:
1. CoolRunner XPLA3
2. CoolRunner -II
3. XC9500
4. XC9500XL
5. XC9500XV
The synthesis for the Cool Runner, XC9500XL, and XC9500XV families includes clock
enable processing; you can allow or invalidate the clock enable signal (when invalidating, it will
be replaced by equivalent logic). Also, the selection of the macros which use the clock enable
(counters, for instance) depends on the family type. A counter with clock enable will be accepted
for Cool Runner and XC9500XL/XV families, but rejected (replaced by equivalent logic) for
XC9500 devices.

Implementation Details for Macro Generation

XST processes the following macros:

1. Adders
2. subtractors
3. add/sub
4. Multipliers
5. Comparators
6. Multiplexers
7. Counters
8. Logical shifters
9. Registers (flip-flops and latches)
10. XORs
The macro generation is decided by the Macro Preserve option, which can take two
values: yes - macro generation is allowed or no - macro generation is inhibited. The general
macro generation flow is the following:
1. HDL infers macros and submits them to the low-level synthesizer.
2. Low-level synthesizer accepts or rejects the macros depending on the resources required for
the macro implementations.

An accepted macro becomes a hierarchical block. For a rejected macro two cases are possible:
1. If the hierarchy is kept (Keep Hierarchy Yes), the macro becomes a hierarchical block.
2. If the hierarchy is not kept (Keep Hierarchy NO), the macro is merged with the surrounded
logic.

A rejected macro is replaced by equivalent logic generated by the HDL synthesizer. A


rejected macro may be decomposed by the HDL synthesizer in component blocks so that one
component may be a new macro requiring fewer resources than the initial one, and the other
smaller macro may be accepted by XST. For instance, a flip-flop macro with clock enable (CE)
cannot be accepted when mapping onto the XC9500. In this case the HDL synthesizer will submit
two new macros:
1. A flip-flop macro without Clock Enable signal.
2. A MUX macro implementing the Clock Enable function.

Very small macros (2-bit adders, 4-bit Multiplexers, shifters with shift distance less than 2) are
always merged with the surrounded logic, independently of the Preserve Macro or Keep
Hierarchy options because the optimization process gives better results for larger components.

Improving Results

XST produces optimized net-lists for the CPLD fitter, which fits them in specified
devices and creates the download programmable files. The CPLD low-level optimization of XST
consists of logic minimization, sub function collapsing, logic factorization, and logic
decomposition. The result of the optimization process is an NGC net-list corresponding to
Boolean equations, which will be reassembled by the CPLD fitter to fit the best of the macro cell
capacities.

5.1.2 XST Design Constraints

Constraints are essential to help you meet your design goals or obtain the best
implementation of your circuit. Constraints are available in XST to control various aspects of the
synthesis process itself, as well as placement and routing. Synthesis algorithms and heuristics
have been tuned to automatically provide optimal results in most situations. In some cases,
however, synthesis may fail to initially achieve optimal results; some of the available constraints
allow you to explore different synthesis alternatives to meet your specific needs.

The following mechanisms are available to specify constraints:


1. Options provide global control on most synthesis aspects. They can be set either from within
the Process Properties dialog box in the Project Navigator or from the command line.
2. VHDL attributes can be directly inserted into your VHDL code and attached to individual
elements of the design to control both synthesis and placement and routing.
3. Constraints can be added as Verilog meta comments in your Verilog code.
4. Constraints can be specified in a separate constraint file.
Typically, global synthesis settings are defined within the Process Properties dialog box
in Project Navigator or with command line arguments, while VHDL attributes or Verilog meta
comments can be inserted in your source code to specify different choices for individual parts of
the design. Note that the local specification of a constraint overrides its global setting. Similarly,
if a constraint is set both on a node (or an instance) and on the enclosing design unit, the former
takes precedence for the considered node.

5.1.3 ARCHITECTURAL OVERVIEW

The Spartan-3 family architecture consists of five fundamental programmable functional


elements:

Configurable Logic Blocks (CLBs) contain RAM-based Look-Up Tables (LUTs) to


implement logic and storage elements that can be used as flip-flops or latches. CLBs can be
programmed to perform a wide variety of logical functions as well as to store data. Input/output
Blocks (IOBs) control the flow of data between the I/O pins and the internal logic of the device.
Each IOB supports bidirectional data flow plus 3-state operation. Double Data-Rate (DDR)
registers are included. The Digitally Controlled Impedance (DCI) feature provides automatic on-
chip terminations, simplifying board designs.

Digital Clock Manager (DCM) blocks provide self-calibrating, fully digital solutions for
distributing, delaying, multiplying, dividing, and phase shifting clock signals. These elements are
organized as shown in Figure 5.1. A ring of IOBs surrounds a regular array of CLBs.

The XC3S50 has a single column of block RAM embedded in the array. Those devices
ranging from the XC3S200 to the XC3S2000 have two columns of block RAM. The XC3S4000
and XC3S5000 devices have four RAM columns. Each column is made up of several 18-Kbit
RAM blocks; each block is associated with a dedicated multiplier. The DCMs are positioned at
the ends of the outer block RAM columns. The Spartan-3 family features a rich network of traces
and switches that interconnect all five functional elements, transmitting signals among them.
Each functional element has an associated switch matrix that permits multiple connections to the
routing.

Configuration

Spartan-3 FPGAs are programmed by loading configuration data into robust,


reprogrammable, static CMOS configuration latches (CCLs) that collectively control all
functional elements and routing resources. Before powering on the FPGA, configuration data is
stored externally in a PROM or some other nonvolatile medium either on or off the board.

Fig 5.1 SPARTAN-3 Family Architecture

After applying power, the configuration data is written to the FPGA using any of five
different modes: Master Parallel, Slave Parallel, Master Serial, Slave Serial, and Boundary Scan
(JTAG). The Master and Slave Parallel modes use an 8-bit-wide Select MAP port.

The recommended memory for storing the configuration data is the low-cost Xilinx
Platform Flash PROM family, which includes the XCF00S PROMs for serial configuration and
the higher density XCF00P PROMs for parallel or serial configuration.

I/O Capabilities
The Select IO feature of Spartan-3 devices supports 18 single- ended standards and 8
differential standards. Many standards support the DCI feature, which uses integrated
terminations to eliminate unwanted signal reflections.

Verilog HDL is one of the two most common Hardware Description Languages (HDL)
used by integrated circuit (IC) designers. The other one is VHDL. HDL’s allows the design to be
simulated earlier in the design cycle in order to correct errors or experiment with different
architectures. Designs described in HDL are technology-independent, easy to design and debug,
and are usually more readable than schematics, particularly for large circuits.

Verilog can be used to describe designs at four levels of abstraction:

 Algorithmic level (much like c code with if, case and loop statements).
 Register transfer level (RTL uses registers connected by Boolean equations).
 Gate level (interconnected AND, NOR etc.)
 Switch level (the switches are MOS transistors inside gates).

The language also defines constructs that can be used to control the input and output of
simulation. More recently Verilog is used as an input for synthesis programs which will generate
a gate-level description (a netlist) for the circuit. Some Verilog constructs are not synthesizable.
Also the way the code is written will greatly effect the size and speed of the synthesized circuit.
Most readers will want to synthesize their circuits, so no synthesizable constructs should be used
only for test benches. These are program modules used to generate I/O needed to simulate the rest
of the design. The words “not synthesizable” will be used for examples and constructs as needed
that do not synthesize.

Verilog (released as IEEE 1364-2001) adds many significant enhancements to the


Verilog language, which add greater support for configurable IP modeling and deep-submicron
accuracy, and development of design management. Other enhancements make Verilog easier to
use. These changes will affect everyone who uses the Verilog language, as well as those who
implement Verilog software tools. This paper will review and highlight the main features added
to the Verilog standard for the IEEE 1364-2001 update. The focus will be on new simulation and
synthesis constructs. Where possible, status regarding Synopsys support for the new features will
also be noted.
5.1.4 History of the IEEE 1364 Verilog standard

The Verilog Hardware Description Language was first introduced in 1984, as a


proprietary language from Gateway Design Automation. The original Verilog language was
designed to be used with a single product, the Gateway Verilog-XL digital logic simulator. In
1989, Gateway Design Automation was acquired by Cadence Design Systems. In 1990, Cadence
released the Verilog Hardware Description Language and the Verilog Programming Language
Interface (PLI) to the public domain. Open Verilog International (OVI) was formed to control the
public domain Verilog, and to promote its usage. Cadence turned over to OVI the Frame Maker
source files containing most, but not all, of the Cadence Verilog-XL user’s manual. This
document became OVI’s Verilog 1.0 Reference Manual. In 1993, OVI released its Verilog 2.0

The IEEE formed a standards working group to create the standard, and, in 1995, IEEE
1364-1995 became the official Verilog standard. It is important to note that for Verilog-1995, the
IEEE standards working group did not consider any enhancements to the Verilog language. The
goal was to standardize the Verilog language the way it was being used at that time. The IEEE
working group also decided not to create an entirely new document for the IEEE 1364 standard.
Instead, the OVI Frame Maker files were used to create the IEEE standard. Since the origin of the
OVI manual was Gateway’s Verilog-XL user’s manual, the IEEE 1364-1995 and IEEE 1364-
2001 Verilog language reference manuals are still organized somewhat like a user’s guide.

Goals for Verilog standard Work on the IEEE 1364-2001 Verilog standard began in January
1997. Three major goals were established:

 Enhance the Verilog language to help with today’s deep-submicron and intellectual property
modeling issues.
 Ensure that all enhancements were both useful and practical, and that simulator and synthesis
 Vendors would implement Verilog-2000 in their products. Correct any errata or ambiguities
in the IEEE 1364-1995 Verilog Language Reference Manual.

5.1.5 MODELING ENHANCEMENTS

Many enhancements improve the ease and accuracy of writing synthesizable RTL
models. Other enhancements allow models to be more scalable and re-usable. With the exception
of the following paragraph, only changes which add new functionality or syntax are listed here.
Verilog also contains many clarifications to Verilog-1995, which do not add new functionality.
Notes are added to the sub-sections indicating Synopsys support with Presto and VCS at the time
this paper was completed. Since the inception of Verilog in 1984, the term “register” has been
used to describe the group of variable data types in the Verilog language. “Register” is not a
keyword, it is simply a name for a class of data types, namely: reg, integer, time, real, and real-
time. The use of term “register” is often a source of confusion for new users of Verilog, who
sometimes assume that the term implies a hardware register (flip-flops).
6.1 IMPLEMENTATION OF PROPOSED METHOD

In the first category, the inputs and outputs of the Montgomery modular multiplication
are represented in binary form, but intermediate results of modular multiplication are kept in
carry-save representation to avoid the carry propagation. However, the format conversion from
the carry-save representation of the final product into its binary representation must be performed
at the end of each modular multiplication. This conversion can be simply accomplished by adding
the carry and sum terms of carry-save representation. But the addition still suffers from long carry
propagation, and extra circuit and time are probably needed for these conversions. The second
category of approaches eliminates repeated interim output-to-input format conversions through
maintaining all inputs and outputs of the Montgomery modular multiplication in carry-save form
except the final step for getting the result of modular exponentiation. However, this implies that
the number of operands in modular multiplication must be increased so that additional registers to
store these operands are required. For example, the work in proposed two variants of
Montgomery multiplication algorithm, which use carry-save adder (CSA) to accomplish the
modular exponentiation.
The first of these variants is based on a five-to-two CSA to avoid the repeated interim
output-to-input format conversion. To further decrease the number of input operands from five to
four, three input operands are selected and combined into the corresponding carry-save form at
the beginning of each modular multiplication. However, extra multiplexers and select signals are
necessary to choose the desired input operands for four-to-two CSA. Moreover, additional
registers are also required to store the combined input operands. Manochehri et al. proposed a
Montgomery multiplication algorithm using pipelined carry-save addition to shorten the critical
path delay of five-to-two CSA. Although a significant reduction in the hardware requirement and
the critical path can be achieved, the increased number of iterations probably results in lower
throughput when compared to previous approaches. The work introduced a simple and fast
algorithm for radix-2 Montgomery multiplication. Although an extra clock cycle is required, the
performance and throughput for five-to-two CSA can be appreciably improved. Shieh et al.
presented an efficient modular multiplication/exponentiation algorithm employing the CSA and
designed a new architecture of modular exponentiation with a unified modular
multiplication/square module to speed up the computation and reduce the hardware complexity.
They also proposed a new Montgomery modular multiplication algorithm for high-speed
hardware design. The corresponding Montgomery multiplier performs the partial product
accumulation and modular reduction in a pipelined fashion, so that the critical path delay is
reduced from the four-to-two to three-to-two carry-save addition at the expense of additional
pipeline registers to store the intermediate values.
The CSA structure can be combined with other techniques and architectures to further
improve the performance of Montgomery multipliers. However, these designs probably cause a
large increase in hardware complexity and power consumption which is undesirable for mobile
devices. In addition, the previously mentioned CSA-based Montgomery multipliers did not
consider the energy issue. Consequently, this paper focuses on reducing the energy consumption
and enhancing the performance of CSA-based Montgomery multipliers with only a slight area
overhead. Several previous works have developed techniques to reduce the power/energy
consumption of Montgomery multipliers. The work designed the low-power Montgomery
multiplier composed of ripple-carry adders by employing the custom CMOS design of several
basic building blocks, including logic gates, full adder, and D flip-flop. Some latches named
glitch blockers are located at the outputs of some circuit modules to reduce the spurious
transitions and the expected switching activities of high fan-out signals in the radix-4 scalable
Montgomery multiplier. In this paper, we attempt to reduce the energy consumption of CSAs and
registers in the CSA-based Montgomery multipliers via the techniques different.
The goal is achieved by first modifying the CSA-based Montgomery algorithm to bypass
the iterations that perform superfluous carry-save addition and register write operations in the
add-shift loop. As a result, not only the addition and shift operations but also the number of clock
cycles required to complete the Montgomery multiplication can be largely decreased, leading to
significant energy saving and higher throughput. On the other hand, the well-known clock gating
technique is also employed to reduce the energy consumption of most registers in the CSA-based
Montgomery multiplier except for these registers (e.g., the registers in BRFA) needed to be right
shifted at each clock cycle. To achieve further energy reduction, we adjust the internal behaviour
and structure of BRFA so that the gated clock design technique can be applied to obviously
decrease the energy consumption of BRFA. Experimental results show that 36% energy saving
and 19.7% cycle reduction can be achieved for the 1024-bitMontgomery multiplier by bypassing
the superfluous operations. Additionally, applying clock gating to registers and the proposed
technique to BRFA of the 1024-bitMontgomery multiplier will lead to 24% more energy
reduction.
6.2 Algorithm for Montgomery Multiplier:
Algorithm MM: Radix-2 Montgomery Multiplication
Inputs : A, B, N (modulus)
Output : S[k]
1. S[0] = 0;
2. for i = 0 to k − 1 {
3. qi = ( S[i]0 + Ai × B0) mod 2;
4. S[i+1] = ( S[i] + Ai× B + qi × N ) / 2;
5. }
6. if ( S[k] ≥ N ) S[k] = S[k] − N;
7. return S[k];

Let the modulus N be a k-bit odd number and an extra factor R be defined as 2k mod N,
where 2k−1 ≤ N < k. Given two integers a and b, where a, b < N, the N-residue of a and b with
respect to R can be defined as A = a × R (mod N), B = b × R (mod N). (1) Based on (1), the
Montgomery modular product Y of A and B can be obtained as Y = A × B × R−1 (mod N) where
R−1 is the inverse of R modulo N, i.e., R × R−1 = 1 (mod N). The radix-2 version of the
Montgomery modular multiplication algorithm, denoted as Algorithm MM, to calculate the
Montgomery modular product of A and B is shown in Algorithm 1. Note that the notation Xi in
Algorithm 1 denotes the i th bit of X in binary representation. Moreover, the notation Xi: j
indicates a segment of X from the i th bit to j th bit.

6.2.1 Algorithm for Montgomery Multiplication using CSA:


Algorithm 2(a) Algorithm MM52: 5-to-2 CSA
Montgomery Multiplication
Inputs : A1, A2, B1, B2, N (modulus)
Outputs : S1[k], S2[k]
1. S1[0] = 0;
2. S2[0] = 0;
3. for i = 0 to k − 1 {
4. qi = (S1[i]0 + S2[i]0 + Ai × (B10 + B20)) mod 2;
5. (S1[i+1], S2[i+1]) = (S1[i] + S2[i] + Ai × (B1 + B2)
+ qi × N) / 2;
6. }
7. return S1[k], S2[k];

Algorithm 2(b) Algorithm MM42: 4-to-2 CSA


Montgomery Multiplication
Inputs : A1, A2, B1, B2, N (modulus)
Outputs : S1[k], S2[k]
1. (D1, D2) = B1 + B2 + N + 0;
2. S1[0] = 0; S2[0] = 0;
3. for i = 0 to k − 1 {
4. qi = (S1[i]0 + S2[i]0 + Ai × (B10 + B20)) mod 2;
5. if (Ai = 0 and qi = 0)
6. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+0+0) / 2;
7. else if (Ai = 0 and qi = 1)
8. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+N+0) / 2;
9. else if (Ai = 1 and qi = 0)
10. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+B1+B2) / 2;
11. else if (Ai = 1 and qi = 1)
12. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+D1+D2) / 2;
}14. return S1[k], S2[k];

Figure 6.1: Montgomery Modular multiplier 4-2 using CSA

Fig. 1 illustrates the corresponding architecture of Algorithm MM42 (denoted as MM42


multiplier), where the dashed line denotes a 1-bit signal. When compared to MM52 multiplier,
MM42 multiplier needs two extra registers (i.e., RD1 and RD2) and two 4-to-1 multiplexers
addressed by two control signals Ai and qi to shorten the overall computational delay. The signals
Ai and qi are generated by the BRFA and Q_Logic circuit, respectively.

Fig.6.2 BRFA

Fig. 6.3. 4 to 2 CSA

6.2.2 Algorithm of modified multiplication algorithm using CSA:


Algorithm 4 Algorithm MMM42: 4-to-2 CSA Modified
Montgomery Multiplication
Inputs : A1, A2, B1, B2, N′ = N + 1 (new modulus)
Outputs : S1[k+3], S2[k+3]
1. ( B1, B2 ) = 0 + 0 + 2B1 + 2B2;
2. (D1, D2) = B1 + B2 + N′ + 0;
3. S1[−1] = 0; S2[−1] = 0;
4. q~ = 0; A ~ = 0; i = −1;
5. while ( i ≤ k + 2 ) {
6. if (A ~ = 0 and q~ = 0 )
7. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+0+0) / 2;
8. else if (A ~ = 0 and q~ = 1 )
9. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+N′+0) / 2;
10. else if (A ~ = 1 and q~ = 0 )
11. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+ B1 + B2 ) / 2;
12. else if (A ~ = 1 and q~ = 1 )
13. (S1[i+1], S2[i+1]) = (S1[i]+S2[i]+D1+D2) / 2;
14. compute qi+1, qi+2, Ai+1, Ai+2, and bypassi+1;
15. if (bypassi+1 = 1){
16. q~ = qi+2; A ~ = Ai+2; i = i + 2;
17. S1[i+1]= S1[i+1]/2; S2[i+1] = S2[i+1]/2;
18. }
19. else{
20. q~ = qi+1; A ~ = Ai+1; i = i + 1;
21. }
22. }
23. return S1[k+3], S2[k+3].
Existing System Block Diagram:

Fig. 6.4 MBRFA


Fig. 6.5 MBRFA(gates)

Proposed MMM 42 Multiplier:

Fig.6.6 Proposed MMM 42 multiplier


7.1 SIMULATION USING VERILOG

7.1.1GENERAL

VHDL and VERILOG are frequently used for two different goals: simulation of
electronic designs and synthesis of such designs. Synthesis is a process where a VHDL and
VERILOG are compiled and mapped into an implementation technology such as an FPGA or an
ASIC. Many FPGA vendors have free tools to synthesize VHDL and VERILOG for use with
their chips, where ASIC tools are often very expensive.Not all constructs in VHDL and
VERILOG are suitable for synthesis. For example, most constructs that explicitly deal with
timings are not synthesizable despite being valid for simulation. While different synthesis tools
have different capabilities, there exists a common synthesizable subset of VHDL and VERILOG
that defines what language constructs and idioms map into common hardware for many synthesis
tools.

8.1 GENERAL

Snapshot is nothing but every moment of the application while running. It gives the clear
elaborated of application. It will be useful for the new user to understand for the future steps.

8.2 SIMULATION RESULTS:

Existing Mod multiplier:

Fig. 8.1 Existing output


Modified Mod Multiplier:

Fig. 8.2 Modified modular output

Fig.8.3 BFRA
Fig.8.4 MBFRA

Fig.8.5 MM42
Fig.8.6 MMM42 CSA

Fig.8.7 MMM42

Total delay for MM42:


Delay: 106.046ns

Total 106.046ns (37.270ns logic, 68.776ns route)

(35.1% logic, 64.9% route)

For MMM42:

Delay: 17.915ns

Total 17.915ns (8.680ns logic, 9.235ns route)

(48.5% logic, 51.5% route)

Table2. COMPARISONS OF DIFFERENT MODULAR EXPONENTIATION DESIGNS


Table2: COMPARISONS OF DIFFERENT MONTGOMERY MULTIPLIERS WITH 512- AND
1024-BIT KEY SIZES

9.1 GENERAL

Most of the cases used in Data communication networks, to reduce the power
consumption and also where the security is necessary for protecting the sensitive data and also
used in smart phones, notebook computer with internet access, official websites.
FUTURE SCOPE

We synthesized these multipliers using the Synopsys Design Compiler and then
performed power simulation using the Synopsys Prime Power with random input patterns. The
implementation results, including the hardware area (Area), the critical path delay (Delay), the
power consumption (Power), the clock cycle number (Cycle) required to complete the operations,
the energy consumption (Energy), and the throughput rate of these modular multipliers are given
in Table III, where throughput rate is formulated as the bit length multiplied by the frequency (the
reciprocal of delay) and then divided by the clock cycle number. Furthermore, P(−) and E(−)
denote the power and energy decrements when compared with the MM42 multiplier. The results
show that the proposed approach can also effectively reduce the energy consumption and enhance
the throughput of modular exponentiation

CONCLUSION

More registers and higher energy consumption were introduced into the high-
speed Montgomery modular multipliers, which speed up the decryption/encryption process by
maintaining all inputs and outputs of the modular multiplication in a redundant carry save format.
This paper presented an efficient algorithm and its corresponding architecture to reduce the
energy consumption and enhance the throughput of Montgomery modular multipliers
simultaneously. Moreover, we modified the structure of BRFA and adopted the gated clock
design technique to further reduce the energy consumption of Montgomery modular multipliers.
Experimental results showed that the proposed approaches are indeed capable of reducing the
energy consumption of the Montgomery multi-pliers. In the future, we will try to heighten the
occurring probability of superfluous operation bypassing to further reduce the energy
consumption and enhance the throughput of modular multiplication.

REFERENCES

[1] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signature and public-key
cryptosystems,” Commun. ACM, vol. 21, no. 2,
pp.120–126, Feb. 1978.
[2] P. L. Montgomery, “Modular multiplication without trial division,” Math. Comput., vol. 44, no. 170,
pp. 519–521, Apr. 1985.
[3] C. K. Koc, T. Acar, and B. S. Kaliski, “Analyzing and comparing Montgomery multiplication
algorithms,” IEEE Micro, vol. 16, no. 3,
pp.26–33, Jun. 1996.
[4] Y. S. Kim, W. S. Kang, and J. R. Choi, “Implementation of 1024-bit modular processor for RSA
cryptosystem,” in Proc. IEEE Asia-Pacific Conf., Aug. 2000, pp. 187–190.
[5] V. Bunimov, M. Schimmler, and B. Tolg, “A complexity-effective version of Montgomery’s
algorithm,” in Proc. Workshop Complexity Effect. Designs, May 2002, pp. 1–7.
[6] A. Cilardo, A. Mazzeo, N. Mazzocca, and L. Romano, “A novel unified architecture for public-key
cryptography,” in Proc. Design, Autom. Test Eur. Conf. Exhibit., Mar. 2005, pp. 52–57.
[7] Z. B. Hu, R. M. A. Shboul, and V. P. Shirochin, “An efficient architecture of 1024-bits
Cryptoprocessor for RSA cryptosystem based on modified Montgomery’s algorithm,” in Proc. 4th
IEEE Int. Workshop Intell Data Acquisit. Adv. Comput. Syst., Sep. 2007, pp. 643–646.
[8] C. McIvor, M. McLoone, and J. V. McCanny, “Modified Montgomery modular multiplication and
RSA exponentiation techniques,” IEE Proc.-Comput. Digit. Tech., vol. 151, no. 6, pp. 402–408, Nov.
2004.
[9] K. Manochehri and S. Pourmozafari, “Fast Montgomery modular mul-tiplication by pipelined CSA
architecture,” in Proc. IEEE Int. Conf. Microelectron., Dec. 2004, pp. 144–147.
[10] K. Manochehri and S. Pourmozafari, “Modified radix-2 Montgomery modular multiplication to make
it faster and simpler,” in Proc. IEEE Int. Conf. Inf. Technol., vol. 1. Apr. 2005, pp. 598–602.
[11] M.-D. Shieh, J.-H. Chen, H.-H. Wu, and W.-C. Lin, “A new modular exponentiation architecture for
efficient design of RSA cryptosystem,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1151–1161, Sep. 2008.
[12] M.-D. Shieh, J.-H. Chen, W.-C. Lin, and H.-H. Wu, “A new algorithm for high-speed modular
multiplication design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 2009–2019, Sep.
2009.
[13] F. Gang, “Design of modular multiplier based on improved Montgomery algorithm and systolic array,”
in Proc. 1st Int. Multi-Symp. Comput. Comput. Sci., vol. 2. 2006, pp. 356–359.
[14] S. S. Ghoreishi, M. A. Pourmina, H. Bozorgi, and M. Dousti, “High speed RSA implementation based
on modified Booth’s technique and Montgomery’s multiplication for FPGA platform,” in Proc. 2nd
Int. Conf. Adv. Circuits, Electron. Micro-Electron., 2009, pp. 86–93.
[15] G. Sassaw, C. J. Jimenez, and M. Valencia, “High radix implementation of Montgomery multipliers
with CSA,” in Proc. Int. Conf. Microelec-tron., 2010, pp. 315–318.
[16] J. C. Neto, A. F. Tenca, and W. V. Ruggiero, “A parallel k-partition method to perform Montgomery
multiplication,” in Proc. IEEE Int. Conf. Appl.-Specif. Syst., Arch. Process., Sep. 2011, pp. 251–254.
[17] A. Cilardo, A. Mazzeo, L. Romano, and G. P. Saggese, “Exploring the design-space for FPGA-based
implementation of RSA,” Micro process. Microsyst., vol. 28, no. 4, pp. 183–191, May 2004.
[18] D. Bayhan, S. B. Ors, and G. Saldamli, “Analyzing and comparing the Montgomery multiplication
algorithms for their power consumption,” in
Proc. Int. Conf. Comput. Eng. Syst., Nov. 2010, pp. 257–261.
[19] X. Wang, P. Noel, and T. Kwasniewski, “Low power design techniques for a Montgomery modular
multiplier,” in Proc. Int. Symp. Intell. Signal Process. Commun. Syst., 2005, pp. 449–452.
[20] H.-K. Son and S.-G. Oh, “Design and implementation of scalable low-power Montgomery multiplier,”
in Proc. IEEE Int. Conf. Comput. Design, 2004, pp. 524–531.
[21] R. Bhutada and Y. Manoli, “Complex clock gating with integrated clock gating cell,” in Proc. Int.
Conf. Design Technol. Integr. Syst. Nanoscale Era, Sep. 2007, pp. 164–169.
[22] D. R. Sulaiman, “Using clock gating technology for energy reduction in portable computers,” in Proc.
Int. Conf. Comput. Commun. Eng., May 2008, pp. 839–842.
[23] J. Chao, Y. Zhao, Z. Wang, S. Mai, and C. Zhang, “Low-power implementations of DSP through
operand isolation and clock gating,” in Proc. Int. Conf. ASIC, Oct. 2007, pp. 229–232.
[24] C. D. Walter, “Montgomery exponentiation needs no final subtractions,” Electron. Lett., vol. 35, no.
21, pp. 1831–1832, Oct. 1999.
[25] J. Ohban, V. G. Moshnyaga, and K. Inoue, “Multiplier energy reduc-tion through bypassing of partial
products,” in Proc. Asia-Pacif. Conf. Circuits Syst., vol. 2. Oct. 2002, pp. 13–17.
[26] J. C. Neto, A. F. Tenca, and W. V. Ruggiero, “Toward an efficient implementation of sequential
Montgomery Multiplication,” in Proc. Asilomar Conf. Signals, Syst. Comput., Nov. 2010, pp. 1680–
1684.
[27] Y.-Y. Zhang, Z. Li, L. Yang, and S.-W. Zhang, “An efficient CSA architecture for Montgomery
modular multiplication,” Microprocess. Microsyst., vol. 31, no. 7, pp. 456–459, Nov. 2007.
[28] A. P. Fournaris and O. Koufopavlou, “A new RSA encryption archi-tecture and hardware
implementation based on optimized Montgomery multiplication,” in Proc. IEEE Int. Symp. Circuits
Syst., May 2005, pp. 4645–4648.
[29] TSMC 0.13-μm (CL013G) Process 1.2-Volt SAGE-XTM Standard Cell Library Databook, Artisan
Components, Sunnyvale, CA, Jan. 2004.
[30] CIC Referenced Flow for Cell-Based IC Design, National Chip Implementation Center, Hsinchu,
Taiwan, 2008.

[31] G.Ramanjaneya Reddy, P.Harinatha Reddy, “Low power & Efficient Multiplier for RSA
Cryptosystems”, in “International Journal of Scientific Engineering & Technology Research”, vol. 03,
issue. 20, ISSN 2319-8885, September-2014, pages 4333-4339

You might also like