Professional Documents
Culture Documents
Title Page No
CHAPTER-1 INTRODUCTION 1-4
1.1 General 1
1.2 Literature Review 1
1.3 Proposed System 3
1.3.1 Advantages in proposed system 3
1.4 Thesis Outline 4
CHAPTER – 2 CRYPTOGRAPHY 5-13
2.1 Public key cryptosystems 5
2.2 Information Security 6
2.3 Public key cryptography 7
2.4 How it Works 8
2.5 RSA Algorithm 9
CHAPTER – 3 ADDER 11-17
3.1 Adder 11
3.1.1Half adder 11
3.1.2 Full adder 11
3.2 Ripple carry adder/sub tractor 12
3.3 Carry look ahed adder 13
3.4 Carry save adder 15
CHAPTER – 4 VLSI TECHNOLOGY 18-39
4.1 Introduction of VLSI 18
4.2 Why VLSI 20
4.2.1 Structured Design 22
4.3 What is VLSI 23
4.3.1 History of Scale Integration 23
4.3.2 System Design 23
4.3.3 MOS and related VLSI Technology 23
4.4 Design of VLSI 24
4.4.1 Challenges 25
4.5 VLSI and SYTEMS 25
4.6 Applications of VLSI 26
4.7 ASIC 27
4.8 ASIC Design flow 27
4.9 CMOS Technology 28
4.9.1 MOS Transistor 29
4.9.2 Power Dissipation in CMOS IC’s 30
4.9.3 CMOS Transmission Gate 30
4.10 Simple ASIC Design flow 31
4.10.1 Register Transfer Logic 31
4.10.2 Optimization 32
4.11 FPGA DESIGN FLOW 32
4.11.1 Introduction to High Capacity FPDS 34
4.11.2 Definitions of Relevant Terminology 35
4.12 Basic FPGA architecture 36
4.12.1 FPGA Design & Programming 36
4.12.2 Advantages of HDLs to FPGA Devices 37
4.12.3 FPGA To ASIC Comparisons 38
4.13 VHDL & VERILOG 39
CHAPTER 5 SOFTWARE REQUIREMENTS 40-47
5.1 Synthesis Tool 40
5.1.1 XILINX ISE 13.1 40
5.1.2 XST Design Constraints 42
5.1.3 Architectural Overview 43
5.1.4 History of IEE 1364 Verilog standard 44
5.1.5 Modeling Enhancements 44
CHAPTER 6 IMPLEMENTATION OF PROPOSED METHOD 48-54
6.1 Introduction 48
6.2 Algorithm for Montgomery multiplier 49
6.2.1 Algorithm for Montgomery multiplier for CSA50
6.2.2 Algorithm for MMM using CSA 52
CHAPTER 7 SIMULATION USING VERILOG 55 -63
7.1 General 55
CHAPTER 8 RESULTS 64 - 67
8.1 General 64
8.2 Simulation Results 64
CHAPTER 9 APPLICATIONS 68
9.1 General 77
FUTURE SCOPE 69
CONCLUSION 70
REFERENCES 71-72
CHAPTER-1
INTRODUCTION
1.1 INTRODUCTION
Due to the rapid increase in internet services and data communication, such as electronic
commerce, fundamental security requirements for protecting sensitive data during electronic
dissemination have become an important concern. Many systems utilize public-key cryptography
to provide such security services, and Rivest, Shamir, and Adleman (RSA) is one of the most
widely adopted public key algorithms at present. However, RSA requires repeated modular
multiplications to accomplish the computation of modular exponentiation, and the size of
modulus is generally at least 1024 bits for long-term security. Therefore, high data throughput
rates without hardware acceleration are difficult to be achieved. Additionally, security
requirements are increasingly important for private data transmission through mobile devices
with Internet access, such as smart phones and notebook computers, which require an energy-
efficient cryptosystem due to their limitation of battery power. For this kind of application, it is
necessary to develop efficient hardware architectures to carry out fast modular multiplications
with low power consumption.
In the first category (e.g., [4]–[7]), the inputs and outputs of the Montgomery modular
multiplication are represented in binary form, but intermediate results of modular multiplication
are kept in carry-save representation to avoid the carry propagation. However, the format
conversion from the carry-save representation of the final product into its binary representation
must be performed at the end of each modular multiplication. This conversion can be simply
accomplished by adding the carry and sum terms of carry-save representation. But the addition
still suffers from long carry propagation, and extra circuit and time are probably needed for these
conversions. The second category of approaches (e.g., [8]–[11]) eliminates repeated interim
output-to-input format conversions through maintaining all inputs and outputs of the Montgomery
modular multiplication in carry-save form except the final step for getting the result of modular
exponentiation. However, this implies that the number of operands in modular multiplication
must be increased so that additional registers to store these operands are required.
For example, the work in [8] proposed two variants of Montgomery multiplication
algorithm, which use carry-save adder (CSA) to accomplish the modular exponentiation. The first
of these variants is based on a five-to-two CSA to avoid the repeated interim output-to-input
format conversion. To further decrease the number of input operands from five to four, three
input operands are selected and combined into the corresponding carry-save form at the
beginning of each modular multiplication. However, extra multiplexers and select signals are
necessary to choose the desired input operands for four-to-two CSA.
The internal behavior and structure of BRFA so that the gated clock design technique can
be applied to obviously decrease the energy consumption of BRFA. Experimental results show
that 36% energy saving and 19.7% cycle reduction can be achieved for the 1024-bit Montgomery
multiplier by bypassing the superfluous operations. Additionally, applying clock gating to
registers and the proposed technique to BRFA of the 1024-bit Montgomery multiplier will lead to
24% more energy reduction.
In previous work to perform the addition in multiplier they used the Carry Save Adder
(CSA). To increase the performance of the multiplier with respective to the speed in this research
work we are going to design Montgomery modular multiplier with Carry Select Adder (CSLA).
The proposed method does not use the large-sized multiplexers but The goal is achieved
by first modifying the CSA-based Montgomery algorithm to bypass the iterations that perform
superfluous carry-save addition and register write operations in the add-shift loop. As a result, not
only the addition and shift operations but also the number of clock cycles required to complete
the Montgomery multiplication can be largely decreased, leading to significant energy saving and
higher throughput
1.4 THESIS OUTLINE
Chapter 1 describes the introduction of previous techniques, their drawbacks, and the
advantages in proposed method. Chapter 2 covers the basics of Cryptography. In Chapter 3 we
will discuss the implementation of different adders used for their design. Chapter 4 covers the
basics of VLSI technology, fabrication of MOS transistors, ASIC design flow, FPGA design flow
& Hardware Description Languages used in VLSI. In Chapter 5 the Synthesis tool Xilinx 13.1
with their XST design constraints are discussed. Chapter 6 covers the proposed method
implementation. Chapter 7 covers the design of adder using verilog code. Chapter 8 shows the
function of CSA and modifiedCSA using Xilinx 13.1. Chapter 9 shows the applications of the
proposed method. The later sections describe the Future Scope, Conclusion & References for this
document. The last Section display the International Journal Publication of this project.
2.1 Public-Key Cryptosystems:
In a public key cryptosystem each user places in a public _le an encryption procedure E.
That is, the public _le is a directory giving the encryption procedure of each user. The user keeps
secret the details of his corresponding decryption procedure D. These procedures have the
following four properties: (a) Deciphering the enciphered form of a message M yields M.
Formally, D(E(M) = M: (1)
(b) Both E and D are easy to compute.
(c) By publicly revealing E the user does not reveal an easy way to compute D. This means that in
practice only he can decrypt messages encrypted with E, or compute D e_ciently. (d) If a message
M is _rst deciphered and then enciphered, M is the result. Formally, E(D(M) = M: (2) An
encryption (or decryption) procedure typically consists of a general method and an encryption
key. The general method, under control of the key, enciphers a message M to obtain the
enciphered form of the message, called the ciphertext C. Everyone can use the same general
method; the security of a given procedure will rest on the security of the key. Revealing an
encryption algorithm then means revealing the key.
When the user reveals E he reveals a very incident method of computing D(C): testing all
possible messages M until one such that E(M) = C is found. If property (c) is satisfied the number
of such messages to test will be so large that this approach is impractical. A function E satisfying
(a)-(c) is a \trap-door one-way function;" if it also sates satisfies (d) it is a \trap-door one-way
permutation." Di_e and Hellman [1] introduced the concept of trap-door one-way functions but
did not present any examples. These functions are called \one-way" because they are easy to
compute in one direction but (apparently) very difficult to compute in the other direction. They
are called \trapdoor" functions since the inverse functions are in fact easy to compute once certain
private \trap-door" information is known. A trap-door one-way function which also satisfies (d)
must be a permutation: every message is the cipher text for some other message and every cipher
text is itself a permissible message. (The mapping is \one to- one" and \onto"). Property (d) is
needed only to implement \signatures." The reader is encouraged to read Di_e and Hellman's
excellent article [1] for further background, for elaboration of the concept of a public-key
cryptosystem, and for a discussion of other problems in the area of cryptography. The ways in
which a public-key cryptosystem can ensure privacy and enable \signatures" are also due to Di_e
and Hellman. For our scenarios we suppose that A and B (also known as Alice and Bob) are two
users of a public-key cryptosystem. We will distinguish their encryption and decryption
procedures with subscripts: EA;DA;EB;DB.
2.2 Information Security:
For example, paper currency requires special inks and material to prevent counterfeiting.
Conceptually, the way information is recorded has not changed dramatically over time. Whereas
information was typically stored and transmitted on paper, much of it now resides on magnetic
media and is transmitted via telecommunications systems, some wireless. What has changed
dramatically is the ability to copy and alter information. One can make thousands of identical
copies of a piece of information stored electronically and each is indistinguishable from the
original. With information on paper, this is much more difficult.
What is needed then for a society where information is mostly stored and transmitted in
electronic form is a means to ensure information security which is independent of the physical
medium recording or conveying it and such that the objectives of information security rely solely
on digital information itself. One of the fundamental tools used in information security is the
signature. It is a building block for many other services such as non-repudiation, data origin
authentication, identification, and witnessing, to mention a few. Having learned the basics in
writing, an individual is taught how to produce a handwritten signature for the purpose of
identification. At contract age the signature evolves to take on a very integral part of the person’s
identity. This signature is intended to be unique to the individual and serve as a means to identify,
authorize, and validate.
With electronic information the concept of a signature needs to be redressed; it cannot
simply be something unique to the signer and independent of the information signed. Electronic
replication of it is so simple that appending a signature to a document not signed by the originator
of the signature is almost a triviality. Analogues of the “paper protocols” currently in use are
required. Hopefully these new electronic based protocols are at least as good as those they
replace. There is a unique opportunity for society to introduce new and more efficient ways of
ensuring information security. Much can be learned from the evolution of the paper based system,
mimicking those aspects which have served us well and removing the inefficiencies. Achieving
information security in an electronic society requires a vast array of technical and legal skills.
There is, however, no guarantee that all of the information security objectives deemed necessary
can be adequately met. The technical means is provided through cryptography.
In an asymmetric key encryption scheme, anyone can encrypt messages using the public key, but
only the holder of the paired private key can decrypt. Security depends on the secrecy of that
private key.
In some related signature schemes, the private key is used to sign a message; anyone can
check the signature using the public key. Validity depends on security of the private key.In the
Diffie–Hellman key exchange scheme, each party generates a public/private key pair and
distributes the public key... After obtaining an authentic copy of each other's public keys, Alice
and Bob can compute a shared secret offline. The shared secret can be used, for instance, as the
key for a symmetric cipher.
The distinguishing technique used in public key cryptography is the use of asymmetric
key algorithms, where the key used to encrypt a message is not the same as the key used to
decrypt it. Each user has a pair of cryptographic keys — a public encryption key and a private
decryption key. The publicly available encrypting-key is widely distributed, while the private
decrypting-key is known only to the recipient. Messages are encrypted with the recipient's public
key and can be decrypted only with the corresponding private key. The keys are related
mathematically, but parameters are chosen so that determining the private key from the public
key is prohibitively expensive. The discovery of algorithms that could produce public/private key
pairs revolutionized the practice of cryptography beginning in the mid-1970s.
In contrast, symmetric-key algorithms, variations of which have been used for thousands
of years, use a single secret key which must be shared and kept private by both sender and
receiver for both encryption and decryption. To use a symmetric encryption scheme, the sender
and receiver must securely share a key in advance.Because symmetric key algorithms are nearly
always much less computationally intensive, it is common to exchange a key using a key-
exchange algorithm and transmit data using that key and a symmetric key algorithm. PGP and the
SSL/TLS family of schemes do this, for instance, and are thus called hybrid cryptosystems.
2.4.1 Description
Public key encryption: a message encrypted with a recipient's public key cannot be
decrypted by anyone except a possessor of the matching private key, it is presumed that
this will be the owner of that key and the person associated with the public key used. This
is used for confidentiality.
Digital signatures: a message signed with a sender's private key can be verified by
anyone who has access to the sender's public key, thereby proving that the sender had
access to the private key (and therefore is likely to be the person associated with the
public key used), and the part of the message that has not been tampered with. On the
question of authenticity, see also message digest.
An analogy to public-key encryption is that of a locked mail box with a mail slot. The
mail slot is exposed and accessible to the public; its location (the street address) is in essence the
public key. Anyone knowing the street address can go to the door and drop a written message
through the slot; however, only the person who possesses the key can open the mailbox and read
the message.An analogy for digital signatures is the sealing of an envelope with a personal wax
seal. The message can be opened by anyone, but the presence of the seal authenticates the
sender.A central problem for use of public-key cryptography is confidence (ideally proof) that a
public key is correct, belongs to the person or entity claimed (i.e., is 'authentic'), and has not been
tampered with or replaced by a malicious third party. The usual approach to this problem is to use
a public-key infrastructure (PKI), in which one or more third parties, known as certificate
authorities, certify ownership of key pairs. PGP, in addition to a certificate authority structure, has
used a scheme generally called the "web of trust", which decentralizes such authentication of
public keys by a central mechanism, substituting individual endorsements of the link between
user and public key. No fully satisfactory solution to the public key authentication problem is
known.
The half adder adds two one-bit binary numbers A and B. It has two outputs, S and C (the value
theoretically carried on to the next addition); the final sum is 2C + S. The simplest half-adder
design, pictured on the right, incorporates an XOR gate for S and an AND gate for C. With the
addition of an OR gate to combine their carry outputs, two half adders can be combined to make a
full adder.
Schematic symbol for a 1-bit full adder with Cin and Cout drawn on sides of block to emphasize
their use in a multi-bit adder.A full adder adds binary numbers and accounts for values carried in
as well as out. A one-bit full adder adds three one-bit numbers, often written as A, B, and Cin; A
and B are the operands, and Cin is a bit carried in from the next less significant stage.[2] The full-
adder is usually a component in a cascade of adders, which add 8, 16, 32, etc. binary numbers.
The circuit produces a two-bit output sum typically represented by the signals Cout and S.
A full adder can be implemented in many different ways such as with a custom transistor-
level circuit or composed of other gates. In this implementation, the final OR gate before the
carry-out output may be replaced by an XOR gate without altering the resulting logic. Using
only two types of gates is convenient if the circuit is being implemented using simple IC
chips which contain only one gate type per chip. In this light, Cout can be implemented.
A full adder can be implemented in many different ways such as with a custom transistor-
level circuit or composed of other gates. In this implementation, the final OR gate before the
carry-out output may be replaced by an XOR gate without altering the resulting logic. Using only
two types of gates is convenient if the circuit is being implemented using simple IC chips which
contain only one gate type per chip. In this light, Cout can be implemented.A full adder can be
constructed from two half adders by connecting A and B to the input of one half adder, connecting
the sum from that to an input to the second adder, connecting Ci to the other input and OR the two
carry outputs. Equivalently, S could be made the three-bit XOR of A, B, and Ci, and Cout could be
made the three-bit majority function of A, B, and Ci.
A simple ripple carry adder is a digital circuit that produces the arithmetic sum of two binary
numbers. It can be constructed with full adders connected in cascade, with the carry output from
each full adder connected to the carry input of the next full adder in the chain. Figure 2 shows the
interconnection of four full adder (FA) circuits to provide a 4-bit ripple carry adder. Notice from
Figure 2 that the input is from the right side because the first cell traditionally represents the least
significant bit (LSB). Bits a0 and b0 in the figure represent the least significant bits of the
numbers to be added. The sum output is represented by the bits s0 –s3 . The main problem with
this type of adder is the delays needed to produce the carry out signal and the most significant
bits. These delays increase with the increase in the number of bits to be added.
87644322
+12355678
=100000000
Using arithmetic addition, we go from right to left, 2+8=0, c-1, 2+7+1=0,c-1 upto end of
sum. Here adding two n- digit numbers take more time.
In electronic terms, using binary bits, this means that even if we have n one-bit adders at our
disposal, we still have to allow a time proportional to n to allow a possible carry to propagate
from one end of the number to the other. Until we have done this,
2. We do not know whether the result of the addition is larger or smaller than a given number (for
instance, we do not know whether it is positive or negative).
A carry look-ahead adder can reduce the delay. In principle the delay can be reduced so
that it is proportional to logn, but for large numbers this is no longer the case, because even when
carry look-ahead is implemented, the distances that signals have to travel on the chip increase in
proportion to n, and propagation delays increase at the same rate. Once we get to the 512-bit to
2048-bit number sizes that are required in public-key cryptography, carry look-ahead is not of
much help.
Binarysum:
10111010101011011111000000001101
+ 11011110101011011011111011101111.
Carry-save arithmetic works by abandoning the binary notation while still working to base 2.
10111010101011011111000000001101
+11011110101011011011111011101111
= 21122120202022022122111011102212.
The notation is unconventional but the result is still unambiguous. Moreover, given n
adders (here, n=32 full adders), the result can be calculated in a single tick of the clock, since
each digit result does not depend on any of the others.If the adder is required to add two numbers
and produce a result, carry-save addition is useless, since the result still has to be converted back
into binary and this still means that carries have to propagate from right to left. But in large-
integer arithmetic, addition is a very rare operation, and adders are mostly used to accumulate
partial sums in a multiplication.
Supposing that we have two bits of storage per digit, we can use a redundant binary
representation, storing the values 0, 1, 2, or 3 in each digit position. It is therefore obvious that
one more binary number can be added to our carry-save result without overflowing our storage
capacity.
The key to success is that at the moment of each partial addition we add three bits:
3.4.3 Drawbacks:
At each stage of a carry-save addition, We know the result of the addition at once. We
still do not know whether the result of the addition is larger or smaller than a given number (for
instance, we do not know whether it is positive or negative).This latter point is a drawback when
using carry-save adders to implement modular multiplication (multiplication followed by
division, keeping the remainder only). If we cannot know whether the intermediate result is
greater or less than the modulus, how can we know whether to subtract the modulus or
not.Montgomery multiplication, which depends on the rightmost digit of the result, is one
solution; though rather like carry-save addition itself, it carries a fixed overhead, so that a
sequence of Montgomery multiplications saves time but a single one does not. Fortunately
exponentiation, which is effectively a sequence of multiplications, is the most common operation
in public-key cryptography.The entire sum can then be computed by Shifting the carry sequence
sc left by one place. Appending a 0 to the front (most significant bit) of the partial sum sequence
ps. Using a ripple carry adder to add these two together and produce the resulting n + 1-bit value.
When adding together three or more numbers, using a carry-save adder followed by a ripple carry
adder is faster than using two ripple carry adders. This is because a ripple carry adder cannot
compute a sum bit without waiting for the previous carry bit to be produced, and thus has a delay
equal to that of n full adders. A carry-save adder, however, produces all of its output values in
parallel, and thus has the same delay as a single full-adder. Thus the total computation time (in
units of full-adder delay time) for a carry-save adder plus a ripple carry adder is n + 1, whereas
for two ripple carry adders it would be 2n.
4.1 INTRODUCTION :
By mid eighties, the transistor count on a single chip had already exceeded 1000 and
hence came the age of Very Large Scale Integration or VLSI. Though many improvements have
been made and the transistor count is still rising, further names of generations like ULSI are
generally avoided. It was during this time when TTL lost the battle to MOS family owing to the
same problems that had pushed vacuum tubes into negligence, power dissipation and the limit it
imposed on the number of gates that could be placed on a single die.The second age of Integrated
Circuits revolution started with the introduction of the first microprocessor, the 4004 by Intel in
1972 and the 8080 in 1974. Today many companies like Texas Instruments, Infineon, Alliance
Semiconductors, Cadence, Synopsys, Celox Networks, Cisco, Micron Tech, National
Semiconductors, ST Microelectronics, Qualcomm, Lucent, Mentor Graphics, Analog Devices,
Intel, Philips, Motorola and many other firms have been established and are dedicated to the
various fields in "VLSI" like Programmable Logic Devices, Hardware Descriptive Languages,
Design tools, Embedded Systems etc.
In 1980s hold-over from outdated taxonomy for integration levels. Obviously influenced
from frequency bands, i.e. HF, VHF, and UHF. Sources disagree on what is measured (gates or
transistors)
SSI – Small-Scale Integration (0-102)
MSI – Medium-Scale Integration (102 -103)
LSI – Large-Scale Integration (103 -105)
VLSI – Very Large-Scale Integration (105 - 107)
ULSI – Ultra Large-Scale Integration (>= 107)
VLSI Technology, Inc was a company which designed and manufactured custom and
semi-custom ICs. The company was based in Silicon Valley, with headquarters at 1109 McKay
Drive in San Jose, California. Along with LSI Logic, VLSI Technology defined the leading edge
of the application-specific integrated circuit (ASIC) business, which accelerated the push of
powerful embedded systems into affordable products. The company was founded in 1979 by a
trio from Fairchild Semiconductor by way of Synertek - Jack Balletto, Dan Floyd, Gunnar
Wetlesen - and by Doug Fairbairn of Xerox PARC and Lambda (later VLSI Design) magazine.
The first semiconductor chips held two transistors each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or systems were
integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten
diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic
gates on a single device. Now known retrospectively as small-scale integration (SSI),
improvements in technique led to devices with hundreds of logic gates, known as medium-scale
integration (MSI). Further improvements led to large-scale integration (LSI), i.e. systems with at
least a thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge
number of gates and transistors available on common devices has rendered such fine distinctions
moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.
As of early 2008, billion-transistor processors are commercially available. This is
expected to become more commonplace as semiconductor fabrication moves from the current
generation of 65 nm processes to the next 45 nm generations (while experiencing new challenges
such as increased variation across process corners). A notable example is Nvidia's 280 series
GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic,
in contrast to the Itanium, whose large transistor count is largely due to its 24 MB L3 cache.
Certain high-performance logic blocks like the SRAM (Static Random Access Memory) cell,
however, are still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading
stability)[citation needed]. VLSI technology is moving towards radical level miniaturization with
introduction of NEMS technology. A lot of problems need to be sorted out before the transition is
actually made.
The course will cover basic theory and techniques of digital VLSI design in CMOS
technology. Topics include: CMOS devices and circuits, fabrication processes, static and dynamic
logic structures, chip layout, simulation and testing, low power techniques, design tools and
methodologies, VLSI architecture. We use full-custom techniques to design basic cells and
regular structures such as data-path and memory. There is an emphasis on modern design issues
in interconnect and clocking. We will also use several case-studies to explore recent real-world
VLSI designs (e.g. Pentium, Alpha, PowerPC Strong ARM, etc.) and papers from the recent
research literature. On-campus students will design small test circuits using various CAD tools.
Circuits will be verified and analyzed for performance with various simulators. Some final project
designs will be fabricated and returned to students the following semester for testing. Very-large-
scale integration (VLSI) is the process of creating integrated circuits by combining thousands of
transistor-based circuits into a single chip. VLSI began in the 1970s when complex
semiconductor and communication technologies were being developed. The microprocessor is a
VLSI device. The term is no longer as common as it once was, as chips have increased in
complexity into the hundreds of millions of transistors.
The first semiconductor chips held one transistor each. Subsequent advances added more
and more transistors, and, as a consequence, more individual functions or systems were integrated
over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes,
transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a
single device. Now known retrospectively as "small-scale integration" (SSI), improvements in
technique led to devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.
systems with at least a thousand logic gates. Current technology has moved far past this mark and
today's microprocessors have many millions of gates and hundreds of millions of individual
transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were used. But the huge
number of gates and transistors available on common devices has rendered such fine distinctions
moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.
Even VLSI is now somewhat quaint, given the common assumption that all microprocessors are
VLSI or better.As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65 nm
processes to the next 45 nm generations (while experiencing new challenges such as increased
variation across process corners). Another notable example is NVIDIA’s 280 series GPU.
This microprocessor is unique in the fact that its 1.4 Billion transistor count, capable of a
teraflop of performance, is almost entirely dedicated to logic (Itanium's transistor count is largely
due to the 24MB L3 cache). Current designs, as opposed to the earliest devices, use extensive
design automation and automated logic synthesis to lay out the transistors, enabling higher levels
of complexity in the resulting logic functionality. Certain high-performance logic blocks like the
SRAM cell, however, are still designed by hand to ensure the highest efficiency (sometimes by
bending or breaking established design rules to obtain the last bit of performance by trading
stability). Thanks to its Caltech and UC Berkeley students, VLSI was an important pioneer in the
electronic design automation industry. It offered a sophisticated package of tools, originally based
on the 'lambda-based' design style advocated by Carver Mead and Lynn Conway. VLSI became
an early vendor of standard cell (cell-based technology) to the merchant market in the early 80s
where the other ASIC-focused company, LSI Logic, was a leader in gate arrays. Prior to VLSI's
cell-based offering, the technology had been primarily available only within large vertically
integrated companies with semiconductor units such as AT&T and IBM.
VLSI's design tools eventually included not only design entry and simulation but
eventually cell-based routing (chip compiler), a data path compiler, SRAM and ROM compilers,
and a state machine compiler. The tools were an integrated design solution for IC design and not
just point tools, or more general purpose system tools. Characterization tools were integrated to
generate Frame Maker Data Sheets for Libraries. VLSI eventually spun off the CAD and Library
operation into Compass Design Automation but it never reached IPO before it was purchased by
Avanti Corp.
VLSI's physical design tools were critical not only to its ASIC business, but also in
setting the bar for the commercial EDA industry. When VLSI and its main ASIC competitor, LSI
Logic, were establishing the ASIC industry, commercially-available tools could not deliver the
productivity necessary to support the physical design of hundreds of ASIC designs each year
without the deployment of a substantial number of layout engineers. The EDA industry finally
caught up in the late 1980s when Tangent Systems released its Tan Cell and Tan Gate products.
VLSI had not been timely in developing a 1.0 µm manufacturing process as the rest of
the industry moved to that geometry in the late 80s. VLSI entered a long-term technology
partnership with Hitachi and finally released a 1.0 µm process and cell library (actually more of a
1.2 µm library with a 1.0 µm gate).As VLSI struggled to gain parity with the rest of the industry
in semiconductor technology, the design flow was moving rapidly to a Verilog HDL and
synthesis flow. Cadence acquired Gateway, the leader in Verilog hardware design language
(HDL) and Synopsys was dominating the exploding field of design synthesis. As VLSI's tools
were being eclipsed, VLSI waited too long to open the tools up to other fabs and Compass Design
Automation was never a viable competitor to industry leaders.
Scientists and innovations from the 'design technology' part of VLSI found their way to
Cadence Design Systems (by way of Redwood Design Automation). Compass Design
Automation (VLSI's CAD and Library spin-off) was sold to Avant! Corporation, which itself was
acquired by Synopsys.
Structured VLSI design is a modular methodology originated by Carver Mead and Lynn
Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained
by repetitive arrangement of rectangular macro blocks which can be interconnected using wiring
by abutment. An example is partitioning the layout of an adder into a row of equal bit slices cells.
In complex designs this structuring may be achieved by hierarchical nesting.Structured VLSI
design had been popular in the early 1980s, but lost its popularity later because of the advent of
placement and routing tools wasting a lot of area by routing, which is tolerated because of the
progress of Moore's Law. When introducing the hardware description language KARL in the mid'
1970s, Reiner Hartenstein coined the term "structured VLSI design" (originally as "structured LSI
design"), echoing Edsger Dijkstra's structured programming approach by procedure nesting to
avoid chaotic spaghetti-structured programs.
4.3 WHAT IS VLSI?
VLSI stands for "Very Large Scale Integration". This is the field which involves packing
more and more logic devices into smaller and smaller areas.
PMOS, NMOS, CMOS, BiCMOS, GaAs are widely used mos transistors for IC
fabrication. Basic MOS Transistors are implemented as minimum line width, transistor cross
section, Charge inversion channel, Source connected to substrate, Enhancement vs. Depletion
mode devices, PMOS are 2.5 time slower than NMOS due to electron and hole mobilities.
Fabrication Technology
It is Silicon of extremely high purity and chemically purified then grown into large
crystals. Wafers is type of crystals are sliced into wafers, and wafer diameter is currently 150mm,
200mm, 300mm and wafer thickness <1mm and also surface is polished to optical smoothness.
Wafer is then ready for processing, Each wafer will yield many chips and the chip die size varies
from about 5mmx5mm to 15mmx15mm. A whole wafer is processed at a time, Different parts of
each die will be made P-type or N-type (small amount of other atoms intentionally introduced -
doping -implant).Interconnections are made with metal insulation used is typically SiO2. SiN is
also used. New materials being investigated (low-k dielectrics).
In CMOS Fabrication p-well process, n-well process and twin-tub process are used. All
the devices on the wafer are made at the same time. After the circuitry has been placed on the
chip, the chip is over glassed (with a passivation layer) to protect it only those areas which
connect to the outside world will be left uncovered (the pads). The wafer finally passes to a test
station test probes send test signal patterns to the chip and monitor the output of the chip.The
yield of a process is the percentage of die which pass this testing, the wafer is then scribed and
separated up into the individual chips. These are then packaged and Chips are ‘binned’ according
to their performance
The complexity of VLSIs being designed and used today makes the manual approach to
design impractical. Design automation is the order of the day. With the rapid technological
developments in the last two decades, the status of VLSI technology is characterized by the
following.
1. A steady increase in the size and hence the functionality of the ICs.
2. A steady reduction in feature size and hence increase in the speed of operation as well
as gate or transistor density.
3. A steady improvement in the predictability of circuit behavior.
4. A steady increase in the variety and size of software tools for VLSI design.
The circuit to be designed would be described in terms of truth tables and state tables.
With these as available inputs, he has to express them as Boolean logic equations and realize
them in terms of gates and flip-flops. Compartmentalization of the approach to design in the
manner described here is the essence of abstraction; it is the basis for development and use of
CAD tools in VLSI design at various levels. The design methods at different levels use the
respective aids such as Boolean equations, truth tables, state transition table, etc. But the aids play
only a small role in the process. To complete a design, one may have to switch from one tool to
another, raising the issues of tool compatibility and learning new environments.
4.4.1 CHALLENGES :
•Power usage/Heat dissipation – As threshold voltages have ceased to scale with advancing
process technology, dynamic power dissipation has not scaled proportionally. Maintaining logic
complexity when scaling the design down only means that the power dissipation per area will go
up. This has given rise to techniques such as dynamic voltage and frequency scaling (DVFS) to
minimize overall power.
Replacing a handful of standard parts with a single chip reduces total power consumption.
Reducing power consumption has a ripple effect on the rest of the system: a smaller, cheaper
power supply can be used; since less power consumption means less heat, a fan may no longer be
necessary; a simpler cabinet with less shielding for electromagnetic shielding may be feasible,
too.
c) Reduced cost
Reducing the number of components, the power supply requirements, cabinet costs, and
so on, will inevitably reduce system cost. The ripple effect of integration is such that the cost of a
system built from custom ICs can be less, even though the individual ICs cost more than the
standard parts they replace.
Electronic systems now perform a wide variety of tasks in daily life. Electronic systems
in some cases have replaced mechanisms that operated mechanically, hydraulically, or by other
means; electronics are usually smaller, more flexible, and easier to service. In other cases
electronic systems have created totally new applications. Electronic systems perform a variety of
tasks, some of them visible, some more hidden:
Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control functions
required for anti-lock braking systems. Digital electronics compress and decompress video, even
at high-definition data rates, on-the-fly in consumer electronics. Low-cost terminals for Web
browsing still require sophisticated electronics, despite their dedicated function. Personal
computers and workstations provide word-processing, financial analysis, and games. Computers
include both central processing units and special-purpose hardware for disk access, faster screen
display, etc.
The growing sophistication of applications continually pushes the design and
manufacturing of integrated circuits and electronic systems to new levels of complexity. And
perhaps the most amazing characteristic of this collection of systems is its variety-as systems
become more complex, we build not a few general-purpose computers but an ever wider range of
special-purpose systems.
4.7 ASIC
As feature sizes have shrunk and design tools improved over the years, the maximum
complexity (and hence functionality) possible in an ASIC has grown from 5,000 gates to over 100
million. Modern ASICs often include entire 32-bit processors, memory blocks including ROM,
RAM, EEPROM, Flash and other large building blocks. Such an ASIC is often termed a SOC
(system-on-a-chip). Designers of digital ASICs use a hardware description language (HDL), such
as Verilog or VHDL, to describe the functionality of ASICs.
Field-programmable gate arrays (FPGA) are the modern-day technology for building a
breadboard or prototype from standard parts; programmable logic blocks and programmable
interconnects allow the same FPGA to be used in many different applications. For smaller
designs and/or lower production volumes, FPGAs may be more cost effective than an ASIC
design even in production.
As with any other technical activity, development of an ASIC starts with an idea and
takes tangible shape through the stages of development. The first step in the process is to expand
the idea in terms of behavior of the target circuit. Through stages of programming, the same is
fully developed into a design description in terms of well defined standard constructs and
conventions.
The design is tested through a simulation process; it is to check, verify, and ensure that
what is wanted is what is described. Simulation is carried out through dedicated tools. With every
simulation run, the simulation results are studied to identify errors in the design description. The
errors are corrected and another simulation run carried out. Simulation and changes to design
description together form a cyclic iterative process, repeated until an error-free design is evolved.
In the present decade the chips being designed are made from CMOS technology. CMOS
is Complementary Metal Oxide Semiconductor. It consists of both NMOS and PMOS transistors.
To understand CMOS better, we first need to know about the MOS transistor.
4.9.1 MOS Transistor
MOS stands for Metal Oxide Semiconductor field effect transistor. MOS is the basic
element in the design of a large scale integrated circuit is the transistor. It is a voltage controlled
device. These transistors are formed as a ``sandwich'' consisting of a semiconductor layer, usually
a slice, or wafer, from a single crystal of silicon; a layer of silicon dioxide (the oxide) and a layer
of metal. These layers are patterned in a manner which permits transistors to be formed in the
semiconductor material (the ``substrate''). The MOS transistor consists of three regions, Source,
Drain and Gate. The source and drain regions are quite similar, and are labeled depending on to
what they are connected. The source is the terminal, or node, which acts as the source of charge
carriers; charge carriers leave the source and travel to the drain. In the case of an N channel
MOSFET (NMOS), the source is the more negative of the terminals; in the case of a P channel
device (PMOS), it is the more positive of the terminals. The area under the gate oxide is called
the ``channel”. Below is figure of a MOS Transistor.
In CMOS, there is only one driver, but the gate can drive as many gates as possible. In
CMOS technology, the output always drives another CMOS gate input. The charge carriers for
PMOS transistors is ‘holes’ and charge carriers for NMOS are electrons. The mobility of
electrons is two times more than that of ‘holes’. Due to this the output rise and fall time is
different. To make it same, the W/L ratio of the PMOS transistor is made about twice that of the
NMOS transistor. This way, the PMOS and NMOS transistors will have the same ‘drive
strength’. The resistance is proportional to (L/W). Therefore, if the increasing the width,
decreases the resistance.
The big percentage of power dissipation in CMOS IC’s is due to the charging and
discharging of capacitors. Majority of the low power CMOS IC designs issue is to reduce power
dissipation. The main sources of power dissipation are:
1. Dynamic Switching Power - Due to charging and discharging of circuit capacitances. A low
to high output transition draws energy from the power supply. A high to low transition dissipates
energy stored in CMOS transistor.
2. Short Circuit Current - It occurs when the rise/fall time at the input of the gate is larger than
the output rise/fall time.
For any design to work at a specific speed, timing analysis has to be performed. We need
to check whether the design is meeting the speed requirement mentioned in the specification. This
is done by Static Timing Analysis Tool, for example Primetime.
RTL is expressed in Verilog or VHDL. This document will cover the basics of Verilog.
Verilog is a Hardware Description Language (HDL). A hardware description language is a
language used to describe a digital system example Latches, Flip-Flops, Combinatorial,
Sequential Elements etc… Basically you can use Verilog to describe any kind of digital system.
One can design a digital system in Verilog using any level of abstraction. The most important
levels are
Behavior Level
Gate level
The system is described in terms of gates (AND, OR, NOT, NAND etc…). The signals
can have only these four logic states (‘0’,’1’,’X’,’Z’). The Gate Level design is normally not done
because the output of Logic Synthesis is the gate level net list.
4.10.2 Optimization
The circuit at the gate level – in terms of the gates and flip-flops can be redundant in
nature. The same can be minimized with the help of minimization tools. The minimized logical
design is converted to a circuit in terms of the switch level cells from standard libraries provided
by the foundries. The cell based design generated by the tool is the last step in the logical design
process.
The designer facing a design problem must go through a series of steps between initial
ideas and final hardware. This series of steps is commonly referred to as the ‘design flow’. First,
after all the requirements have been spelled out, a proper digital design phase must be carried out.
It should be stressed that the tools supplied by the different FPGA vendors to target their chips do
not help the designer in this phase. They only enter the scene once the designer is ready to
translate a given design into working hardware.
The most common flow nowadays used in the design of FPGAs involves the following
subsequent phases:
Design entry - This step consists in transforming the design ideas into some form of
computerized representation. This is most commonly accomplished using Hardware Description
Languages (HDLs). The two most popular HDLs are Verilog and the Very High Speed Integrated
Circuit HDL (VHDL) [2]. It should be noted that an HDL, as its name implies, is only a tool to
describe a design that pre-existed in the mind, notes, and sketches of a designer. It is not a tool to
design electronic circuits. Another point to note is that HDLs differ from conventional software
programming languages in the sense that they don’t support the concept of sequential execution
of statements in the code. This is easy to understand if one considers the alternative schematic
representation of an HDL file: what one sees in the upper part of the schematic cannot be said to
happen before or after what one sees in the lower part.
Synthesis - The synthesis tool receives HDL and a choice of FPGA vendor and model. From
these two pieces of information, it generates a netlist which uses the primitives proposed by the
vendor in order to satisfy the logic behaviour specified in the HDL files. Most synthesis tools go
through additional steps such as logic optimization, register load balancing, and other techniques
to enhance timing performance, so the resulting netlist can be regarded as a very efficient
implementation of the HDL design.
Place and route - The placer takes the synthesized net list and chooses a place for each of the
primitives inside the chip. The router’s task is then to interconnect all these primitives together
satisfying the timing constraints. The most obvious constraint for a design is the frequency of the
system clock, but there are more involved constraints one can impose on a design using the
software packages supported by the vendors.
Bit stream generation - FPGAs are typically configured at power-up time from some sort of
external permanent storage device, typically a flash memory. Once the place and route process is
finished, the resulting choices for the configuration of each programmable element in the FPGA
chip, be it logic or interconnect, must be stored in a file to program the flash. Of these four
phases, only the first one is human-labour intensive. Somebody has to type in the HDL code,
which can be tedious and error-prone for complicated designs involving, for example, lots of
digital signal processing. This is the reason for the appearance, in recent years, of alternative
flows which include a preliminary phase in which the user can draw blocks at a higher level of
abstraction and rely on the software tool for the generation of the HDL.
The standard FPGA design flow starts with design entry using schematics or a hardware
description language (HDL), such as Verilog HDL or VHDL. In this step, you create the digital
circuit that is implemented inside the FPGA. The flow then proceeds through compilation,
simulation, programming, and verification in the FPGA hardware we first define the relevant
terminology in the field and then describe the recent evolution of FPDs. The three main
categories of FPDs are delineated: Simple PLDs (SPLDs), Complex PLDs (CPLDs) and Field-
Programmable Gate Arrays (FPGAs).
The market for FPDs has grown dramatically over the past decade to the point where
there is now a wide assortment of devices to choose from. A designer today faces a daunting task
to research the different types of chips, understand what they can best be used for, choose a
particular manufacturer’s product, learn the intricacies of vendor-specific software and then
design the hardware. Confusion for designers is exacerbated by not only the sheer number of
field-programmable devices available, but also by the complexity of the more sophisticated
devices. The purpose of this paper is to provide an overview of the architecture of the various
types of field-programmable devices. The emphasis is on devices with relatively high logic
capacity. Before proceeding, we provide definitions of the terminology in this field. This is
necessary because the technical jargon has become somewhat inconsistent over the past few years
as companies have attempted to compare and contrast their products in literature.
4.11.2 Definitions of Relevant Terminology
• Field-Programmable Device (FPD): A general term that refers to any type of integrated circuit
used for implementing digital hardware, where the chip can be configured by the end user to
realize different designs. Programming of such a device often involves placing the chip into a
special programming unit, but some chips can also be configured “in-system”. Another name for
FPDs is programmable logic devices (PLDs); although PLDs encompass the same types of chips
as FPDs, we prefer the term FPD.
• PLA : Programmable Logic Array (PLA) is a relatively small FPD that contains two levels of
logic, an AND-plane and an OR-plane, where both levels are programmable .
• PAL: Programmable Array Logic (PAL) is a relatively small FPD that has a programmable
AND-plane followed by a fixed OR-plane.
• SPLD: It refers to any type of Simple PLD, usually either a PLA or PAL.
• CPLD: A more Complex PLD that consists of an arrangement of multiple SPLD-like blocks on
a single chip. Alternative names (that will not be used in this paper) sometimes adopted for this
style of chip are Enhanced PLD (EPLD), Super PAL, Mega PAL, and others.
• FPGA: Field-Programmable Gate Array is an FPD featuring a general structure that allows
Very high logic capacity. Whereas CPLDs feature logic resources with a wide number of inputs
(AND planes), FPGAs offer more narrow logic resources. FPGAs also offer a higher ratio of flip-
flops to logic resources than do CPLDs.
• Logic Block: A relatively small circuit block that is replicated in an array in an FPD. When a
circuit is implemented in an FPD, it is first decomposed into smaller sub-circuits that can each be
mapped into a logic block. The term logic block is mostly used in the context of FPGAs, but it
could also refer to a block of circuitry in a CPLD.
• Logic Capacity: The amount of digital logic that can be mapped into a single FPD. This is
usually measured in units of “equivalent number of gates in a traditional gate array”. In other
words, the capacity of an FPD is measured by the size of gate array that it is comparable to. In
simpler terms, logic capacity can be thought of as “number of 2-input NAND gates”.
The most common FPGA architecture consists of an array of configurable logic blocks
(CLBs), I/O pads, and routing channels. Generally, all the routing channels have the same width
(number of wires). Multiple I/O pads may fit into the height of one row or the width of one
column in the array.
An application circuit must be mapped into an FPGA with adequate resources. While the
number of CLBs and I/Os required is easily determined from the design, the number of routing
tracks needed may vary considerably even among designs with the same amount of logic. (For
example, a crossbar switch requires much more routing than a systolic array with the same gate
count.) Since unused routing tracks increase the cost (and decrease the performance) of the part
without providing any benefit, FPGA manufacturers try to provide just enough tracks so that most
designs that will fit in terms of LUTs and IOs can be routed. This is determined by estimates such
as those derived from Rent's rule or by experiments with existing designs.
To define the behavior of the FPGA, the user provides a hardware description language
(HDL) or a schematic design. The HDL form might be easier to work with when handling large
structures because it's possible to just specify them numerically rather than having to draw every
piece by hand. On the other hand, schematic entry can allow for easier visualization of a design.
Going from schematic/HDL source files to actual configuration. The source files are fed
to a software suite from the FPGA/CPLD vendor that through different steps will produce a file.
This file is then transferred to the FPGA/CPLD via a serial interface or to an external memory
device like an EEPROM.
The most common HDLs are VHDL and Verilog, although in an attempt to reduce the
complexity of designing in HDLs, which have been compared to the equivalent of assembly
languages, there are moves to raise the abstraction level through the introduction of alternative
languages.
• Functional Simulation Early in the Design Flow You can verify design functionality early in the
design flow by simulating the HDL description. Testing your design decisions before the design
is implemented at the Register Transfer Level (RTL) or gate level allows you to make any
necessary changes early on.
• Synthesis of HDL Code to Gates Synthesizing your hardware description to target the FPGA
implementation. Decreases design time by allowing a higher-level design specification, rather
than specifying the design from the FPGA base elements.
• Reduces the errors that can occur during a manual translation of a hardware description to a
schematic design.
• Allows you to apply the automation techniques used by the synthesis tool (such as machine
encoding styles and automatic I/O insertion) during optimization to the original HDL code. This
results in greater optimization and efficiency.
• Early Testing of Various Design Implementations HDLs allow you to test different design
implementations early in the design flow. Use the synthesis tool to perform the logic synthesis
and optimization into gates. Additionally, Xilinx FPGA devices allow you to implement your
design at your computer. Since the synthesis time is short, you have more time to explore
different architectural possibilities at the Register Transfer Level (RTL). You can reprogram
Xilinx FPGA devices to test several design implementations.
There have been a small number of past attempts to quantify the gap between FPGAs and
ASICs which we will review here. One of the earliest statements quantifying the gap between
FPGAs and pre-fabricated media was by Brown. That work reported the logic density gap
between FPGAs and Mask-programmable Gate Arrays (MPGAs) to be between 8 to 12 times,
and the circuit performance gap to be approximately a factor of 3.
The basis for these numbers was a cursory comparison of the largest available gate counts
in each technology, and the anecdotal reports of the approximate operating frequencies in the two
technologies at the time. While the latter may have been reasonable, the former from optimistic
gate counting in FPGAs. And seeking to measure the gap against standard cell implementations,
rather than the less common MPGA.
Aside from the reliance on anecdotal evidence, the analysis in is dated since it does not
include the impact of hard dedicated circuit structures such as multipliers and block memories
that are now common. In this work, we address this issue by explicitly considering the
incremental impact of such blocks. More recently, a detailed comparison of FPGA and ASIC
implementations was performed.
They found that the delay of an FPGA lookup table (LUT) was approximately 12 to 14
times the delay of an ASIC gate. ASIC gate density was found to be approximately 45 times
greater than that possible in FPGAs when measured in terms of kilo-gates per square micron.
Finally, the dynamic power consumption of a LUT was found to be over 500 times greater than
the power of an ASIC gate. Both the density and the power consumption exhibited variability
across process generations but the cause of such variability was unclear. The main issue with this
work is that it also depends on the number of gates that can be implemented by a LUT.
The area differences between FPGA and standard cell designs. They implemented
multiple circuits from eight different application domains, including areas such as radar and
image processing, on the Xilinx Virtex-II FPGA. Since the Xilinx Virtex-II is designed in 0:15
nm CMOS technology, the area results are scaled up to allow direct comparison with 0:18 nm
CMOS. Using this approach, they found that the FPGA implementation is only 7.2 times larger
on average than a standard cell implementation.
The measurements of the gaps between FPGAs and ASICs described in the previous
section were generally based on simple estimates or single-point comparisons. To provide a more
measurement, our approach is to implement a range of benchmark circuits in FPGAs and standard
cells with both designed using the same IC fabrication process geometry. This comparison was
performed using 90 nm CMOS technologies to implement a large set of benchmarks. This device
is fabricated using TSMC's Nexsys 90 nm process.
Both VHDL and Verilog are well established hardware description languages. They have
the advantage that the user can define high-level algorithms and low-level optimizations (gate-
level and switch-level) in the same language. A basic example of VHDL code, the evaluation of
the Fibonacci series, is shown below, and it is a good example of the points made above. The
code itself is reasonably straightforward for a software programmer to understand, provided that
he/she understands that this is a truly parallel language and all lines are executing “at once”.
Xilinx, Inc. is the world's largest supplier of programmable logic devices, the inventor of
the field programmable gate array (FPGA) and the first semiconductor company with a fabless
manufacturing model.
Xilinx designs, develops and markets programmable logic products including integrated
circuits (ICs), software design tools, predefined system functions delivered as intellectual
property (IP) cores, design services, customer training, field engineering and technical support.
Xilinx sells both FPGAs and CPLDs programmable logic devices for electronic equipment
manufacturers in end markets such as communications, industrial, consumer, automotive and data
processing. Xilinx's FPGAs have even been used for the ALICE (A Large Ion Collider
Experiment) at the CERN European laboratory on the French-Swiss border to map and
disentangle the trajectories of thousands of subatomic particles.
The Virtex-II Pro, Virtex-4, Virtex-5, and Virtex-6 FPGA families are particularly
focused on system-on-chip (SoC) designers because they include up to two embedded IBM
PowerPC cores.
The ISE Design Suite is the central electronic design automation (EDA) product family
sold by Xilinx. The ISE Design Suite features include design entry and synthesis supporting
Verilog or VHDL, place-and-route (PAR), completed verification and debug using Chip Scope
Pro tools, and creation of the bit files that are used to configure the chip.XST-Xilinx Synthesis
Technology performs device specific synthesis for CoolRunner XPLA3/-II and XC9500/XL/XV
families and generates an NGC file ready for the CPLD fitter.
Families
Five families are supported by XST for CPLD synthesis:
1. CoolRunner XPLA3
2. CoolRunner -II
3. XC9500
4. XC9500XL
5. XC9500XV
The synthesis for the Cool Runner, XC9500XL, and XC9500XV families includes clock
enable processing; you can allow or invalidate the clock enable signal (when invalidating, it will
be replaced by equivalent logic). Also, the selection of the macros which use the clock enable
(counters, for instance) depends on the family type. A counter with clock enable will be accepted
for Cool Runner and XC9500XL/XV families, but rejected (replaced by equivalent logic) for
XC9500 devices.
1. Adders
2. subtractors
3. add/sub
4. Multipliers
5. Comparators
6. Multiplexers
7. Counters
8. Logical shifters
9. Registers (flip-flops and latches)
10. XORs
The macro generation is decided by the Macro Preserve option, which can take two
values: yes - macro generation is allowed or no - macro generation is inhibited. The general
macro generation flow is the following:
1. HDL infers macros and submits them to the low-level synthesizer.
2. Low-level synthesizer accepts or rejects the macros depending on the resources required for
the macro implementations.
An accepted macro becomes a hierarchical block. For a rejected macro two cases are possible:
1. If the hierarchy is kept (Keep Hierarchy Yes), the macro becomes a hierarchical block.
2. If the hierarchy is not kept (Keep Hierarchy NO), the macro is merged with the surrounded
logic.
Very small macros (2-bit adders, 4-bit Multiplexers, shifters with shift distance less than 2) are
always merged with the surrounded logic, independently of the Preserve Macro or Keep
Hierarchy options because the optimization process gives better results for larger components.
Improving Results
XST produces optimized net-lists for the CPLD fitter, which fits them in specified
devices and creates the download programmable files. The CPLD low-level optimization of XST
consists of logic minimization, sub function collapsing, logic factorization, and logic
decomposition. The result of the optimization process is an NGC net-list corresponding to
Boolean equations, which will be reassembled by the CPLD fitter to fit the best of the macro cell
capacities.
Constraints are essential to help you meet your design goals or obtain the best
implementation of your circuit. Constraints are available in XST to control various aspects of the
synthesis process itself, as well as placement and routing. Synthesis algorithms and heuristics
have been tuned to automatically provide optimal results in most situations. In some cases,
however, synthesis may fail to initially achieve optimal results; some of the available constraints
allow you to explore different synthesis alternatives to meet your specific needs.
Digital Clock Manager (DCM) blocks provide self-calibrating, fully digital solutions for
distributing, delaying, multiplying, dividing, and phase shifting clock signals. These elements are
organized as shown in Figure 5.1. A ring of IOBs surrounds a regular array of CLBs.
The XC3S50 has a single column of block RAM embedded in the array. Those devices
ranging from the XC3S200 to the XC3S2000 have two columns of block RAM. The XC3S4000
and XC3S5000 devices have four RAM columns. Each column is made up of several 18-Kbit
RAM blocks; each block is associated with a dedicated multiplier. The DCMs are positioned at
the ends of the outer block RAM columns. The Spartan-3 family features a rich network of traces
and switches that interconnect all five functional elements, transmitting signals among them.
Each functional element has an associated switch matrix that permits multiple connections to the
routing.
Configuration
After applying power, the configuration data is written to the FPGA using any of five
different modes: Master Parallel, Slave Parallel, Master Serial, Slave Serial, and Boundary Scan
(JTAG). The Master and Slave Parallel modes use an 8-bit-wide Select MAP port.
The recommended memory for storing the configuration data is the low-cost Xilinx
Platform Flash PROM family, which includes the XCF00S PROMs for serial configuration and
the higher density XCF00P PROMs for parallel or serial configuration.
I/O Capabilities
The Select IO feature of Spartan-3 devices supports 18 single- ended standards and 8
differential standards. Many standards support the DCI feature, which uses integrated
terminations to eliminate unwanted signal reflections.
Verilog HDL is one of the two most common Hardware Description Languages (HDL)
used by integrated circuit (IC) designers. The other one is VHDL. HDL’s allows the design to be
simulated earlier in the design cycle in order to correct errors or experiment with different
architectures. Designs described in HDL are technology-independent, easy to design and debug,
and are usually more readable than schematics, particularly for large circuits.
Algorithmic level (much like c code with if, case and loop statements).
Register transfer level (RTL uses registers connected by Boolean equations).
Gate level (interconnected AND, NOR etc.)
Switch level (the switches are MOS transistors inside gates).
The language also defines constructs that can be used to control the input and output of
simulation. More recently Verilog is used as an input for synthesis programs which will generate
a gate-level description (a netlist) for the circuit. Some Verilog constructs are not synthesizable.
Also the way the code is written will greatly effect the size and speed of the synthesized circuit.
Most readers will want to synthesize their circuits, so no synthesizable constructs should be used
only for test benches. These are program modules used to generate I/O needed to simulate the rest
of the design. The words “not synthesizable” will be used for examples and constructs as needed
that do not synthesize.
The IEEE formed a standards working group to create the standard, and, in 1995, IEEE
1364-1995 became the official Verilog standard. It is important to note that for Verilog-1995, the
IEEE standards working group did not consider any enhancements to the Verilog language. The
goal was to standardize the Verilog language the way it was being used at that time. The IEEE
working group also decided not to create an entirely new document for the IEEE 1364 standard.
Instead, the OVI Frame Maker files were used to create the IEEE standard. Since the origin of the
OVI manual was Gateway’s Verilog-XL user’s manual, the IEEE 1364-1995 and IEEE 1364-
2001 Verilog language reference manuals are still organized somewhat like a user’s guide.
Goals for Verilog standard Work on the IEEE 1364-2001 Verilog standard began in January
1997. Three major goals were established:
Enhance the Verilog language to help with today’s deep-submicron and intellectual property
modeling issues.
Ensure that all enhancements were both useful and practical, and that simulator and synthesis
Vendors would implement Verilog-2000 in their products. Correct any errata or ambiguities
in the IEEE 1364-1995 Verilog Language Reference Manual.
Many enhancements improve the ease and accuracy of writing synthesizable RTL
models. Other enhancements allow models to be more scalable and re-usable. With the exception
of the following paragraph, only changes which add new functionality or syntax are listed here.
Verilog also contains many clarifications to Verilog-1995, which do not add new functionality.
Notes are added to the sub-sections indicating Synopsys support with Presto and VCS at the time
this paper was completed. Since the inception of Verilog in 1984, the term “register” has been
used to describe the group of variable data types in the Verilog language. “Register” is not a
keyword, it is simply a name for a class of data types, namely: reg, integer, time, real, and real-
time. The use of term “register” is often a source of confusion for new users of Verilog, who
sometimes assume that the term implies a hardware register (flip-flops).
6.1 IMPLEMENTATION OF PROPOSED METHOD
In the first category, the inputs and outputs of the Montgomery modular multiplication
are represented in binary form, but intermediate results of modular multiplication are kept in
carry-save representation to avoid the carry propagation. However, the format conversion from
the carry-save representation of the final product into its binary representation must be performed
at the end of each modular multiplication. This conversion can be simply accomplished by adding
the carry and sum terms of carry-save representation. But the addition still suffers from long carry
propagation, and extra circuit and time are probably needed for these conversions. The second
category of approaches eliminates repeated interim output-to-input format conversions through
maintaining all inputs and outputs of the Montgomery modular multiplication in carry-save form
except the final step for getting the result of modular exponentiation. However, this implies that
the number of operands in modular multiplication must be increased so that additional registers to
store these operands are required. For example, the work in proposed two variants of
Montgomery multiplication algorithm, which use carry-save adder (CSA) to accomplish the
modular exponentiation.
The first of these variants is based on a five-to-two CSA to avoid the repeated interim
output-to-input format conversion. To further decrease the number of input operands from five to
four, three input operands are selected and combined into the corresponding carry-save form at
the beginning of each modular multiplication. However, extra multiplexers and select signals are
necessary to choose the desired input operands for four-to-two CSA. Moreover, additional
registers are also required to store the combined input operands. Manochehri et al. proposed a
Montgomery multiplication algorithm using pipelined carry-save addition to shorten the critical
path delay of five-to-two CSA. Although a significant reduction in the hardware requirement and
the critical path can be achieved, the increased number of iterations probably results in lower
throughput when compared to previous approaches. The work introduced a simple and fast
algorithm for radix-2 Montgomery multiplication. Although an extra clock cycle is required, the
performance and throughput for five-to-two CSA can be appreciably improved. Shieh et al.
presented an efficient modular multiplication/exponentiation algorithm employing the CSA and
designed a new architecture of modular exponentiation with a unified modular
multiplication/square module to speed up the computation and reduce the hardware complexity.
They also proposed a new Montgomery modular multiplication algorithm for high-speed
hardware design. The corresponding Montgomery multiplier performs the partial product
accumulation and modular reduction in a pipelined fashion, so that the critical path delay is
reduced from the four-to-two to three-to-two carry-save addition at the expense of additional
pipeline registers to store the intermediate values.
The CSA structure can be combined with other techniques and architectures to further
improve the performance of Montgomery multipliers. However, these designs probably cause a
large increase in hardware complexity and power consumption which is undesirable for mobile
devices. In addition, the previously mentioned CSA-based Montgomery multipliers did not
consider the energy issue. Consequently, this paper focuses on reducing the energy consumption
and enhancing the performance of CSA-based Montgomery multipliers with only a slight area
overhead. Several previous works have developed techniques to reduce the power/energy
consumption of Montgomery multipliers. The work designed the low-power Montgomery
multiplier composed of ripple-carry adders by employing the custom CMOS design of several
basic building blocks, including logic gates, full adder, and D flip-flop. Some latches named
glitch blockers are located at the outputs of some circuit modules to reduce the spurious
transitions and the expected switching activities of high fan-out signals in the radix-4 scalable
Montgomery multiplier. In this paper, we attempt to reduce the energy consumption of CSAs and
registers in the CSA-based Montgomery multipliers via the techniques different.
The goal is achieved by first modifying the CSA-based Montgomery algorithm to bypass
the iterations that perform superfluous carry-save addition and register write operations in the
add-shift loop. As a result, not only the addition and shift operations but also the number of clock
cycles required to complete the Montgomery multiplication can be largely decreased, leading to
significant energy saving and higher throughput. On the other hand, the well-known clock gating
technique is also employed to reduce the energy consumption of most registers in the CSA-based
Montgomery multiplier except for these registers (e.g., the registers in BRFA) needed to be right
shifted at each clock cycle. To achieve further energy reduction, we adjust the internal behaviour
and structure of BRFA so that the gated clock design technique can be applied to obviously
decrease the energy consumption of BRFA. Experimental results show that 36% energy saving
and 19.7% cycle reduction can be achieved for the 1024-bitMontgomery multiplier by bypassing
the superfluous operations. Additionally, applying clock gating to registers and the proposed
technique to BRFA of the 1024-bitMontgomery multiplier will lead to 24% more energy
reduction.
6.2 Algorithm for Montgomery Multiplier:
Algorithm MM: Radix-2 Montgomery Multiplication
Inputs : A, B, N (modulus)
Output : S[k]
1. S[0] = 0;
2. for i = 0 to k − 1 {
3. qi = ( S[i]0 + Ai × B0) mod 2;
4. S[i+1] = ( S[i] + Ai× B + qi × N ) / 2;
5. }
6. if ( S[k] ≥ N ) S[k] = S[k] − N;
7. return S[k];
Let the modulus N be a k-bit odd number and an extra factor R be defined as 2k mod N,
where 2k−1 ≤ N < k. Given two integers a and b, where a, b < N, the N-residue of a and b with
respect to R can be defined as A = a × R (mod N), B = b × R (mod N). (1) Based on (1), the
Montgomery modular product Y of A and B can be obtained as Y = A × B × R−1 (mod N) where
R−1 is the inverse of R modulo N, i.e., R × R−1 = 1 (mod N). The radix-2 version of the
Montgomery modular multiplication algorithm, denoted as Algorithm MM, to calculate the
Montgomery modular product of A and B is shown in Algorithm 1. Note that the notation Xi in
Algorithm 1 denotes the i th bit of X in binary representation. Moreover, the notation Xi: j
indicates a segment of X from the i th bit to j th bit.
Fig.6.2 BRFA
7.1.1GENERAL
VHDL and VERILOG are frequently used for two different goals: simulation of
electronic designs and synthesis of such designs. Synthesis is a process where a VHDL and
VERILOG are compiled and mapped into an implementation technology such as an FPGA or an
ASIC. Many FPGA vendors have free tools to synthesize VHDL and VERILOG for use with
their chips, where ASIC tools are often very expensive.Not all constructs in VHDL and
VERILOG are suitable for synthesis. For example, most constructs that explicitly deal with
timings are not synthesizable despite being valid for simulation. While different synthesis tools
have different capabilities, there exists a common synthesizable subset of VHDL and VERILOG
that defines what language constructs and idioms map into common hardware for many synthesis
tools.
8.1 GENERAL
Snapshot is nothing but every moment of the application while running. It gives the clear
elaborated of application. It will be useful for the new user to understand for the future steps.
Fig.8.3 BFRA
Fig.8.4 MBFRA
Fig.8.5 MM42
Fig.8.6 MMM42 CSA
Fig.8.7 MMM42
For MMM42:
Delay: 17.915ns
9.1 GENERAL
Most of the cases used in Data communication networks, to reduce the power
consumption and also where the security is necessary for protecting the sensitive data and also
used in smart phones, notebook computer with internet access, official websites.
FUTURE SCOPE
We synthesized these multipliers using the Synopsys Design Compiler and then
performed power simulation using the Synopsys Prime Power with random input patterns. The
implementation results, including the hardware area (Area), the critical path delay (Delay), the
power consumption (Power), the clock cycle number (Cycle) required to complete the operations,
the energy consumption (Energy), and the throughput rate of these modular multipliers are given
in Table III, where throughput rate is formulated as the bit length multiplied by the frequency (the
reciprocal of delay) and then divided by the clock cycle number. Furthermore, P(−) and E(−)
denote the power and energy decrements when compared with the MM42 multiplier. The results
show that the proposed approach can also effectively reduce the energy consumption and enhance
the throughput of modular exponentiation
CONCLUSION
More registers and higher energy consumption were introduced into the high-
speed Montgomery modular multipliers, which speed up the decryption/encryption process by
maintaining all inputs and outputs of the modular multiplication in a redundant carry save format.
This paper presented an efficient algorithm and its corresponding architecture to reduce the
energy consumption and enhance the throughput of Montgomery modular multipliers
simultaneously. Moreover, we modified the structure of BRFA and adopted the gated clock
design technique to further reduce the energy consumption of Montgomery modular multipliers.
Experimental results showed that the proposed approaches are indeed capable of reducing the
energy consumption of the Montgomery multi-pliers. In the future, we will try to heighten the
occurring probability of superfluous operation bypassing to further reduce the energy
consumption and enhance the throughput of modular multiplication.
REFERENCES
[1] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signature and public-key
cryptosystems,” Commun. ACM, vol. 21, no. 2,
pp.120–126, Feb. 1978.
[2] P. L. Montgomery, “Modular multiplication without trial division,” Math. Comput., vol. 44, no. 170,
pp. 519–521, Apr. 1985.
[3] C. K. Koc, T. Acar, and B. S. Kaliski, “Analyzing and comparing Montgomery multiplication
algorithms,” IEEE Micro, vol. 16, no. 3,
pp.26–33, Jun. 1996.
[4] Y. S. Kim, W. S. Kang, and J. R. Choi, “Implementation of 1024-bit modular processor for RSA
cryptosystem,” in Proc. IEEE Asia-Pacific Conf., Aug. 2000, pp. 187–190.
[5] V. Bunimov, M. Schimmler, and B. Tolg, “A complexity-effective version of Montgomery’s
algorithm,” in Proc. Workshop Complexity Effect. Designs, May 2002, pp. 1–7.
[6] A. Cilardo, A. Mazzeo, N. Mazzocca, and L. Romano, “A novel unified architecture for public-key
cryptography,” in Proc. Design, Autom. Test Eur. Conf. Exhibit., Mar. 2005, pp. 52–57.
[7] Z. B. Hu, R. M. A. Shboul, and V. P. Shirochin, “An efficient architecture of 1024-bits
Cryptoprocessor for RSA cryptosystem based on modified Montgomery’s algorithm,” in Proc. 4th
IEEE Int. Workshop Intell Data Acquisit. Adv. Comput. Syst., Sep. 2007, pp. 643–646.
[8] C. McIvor, M. McLoone, and J. V. McCanny, “Modified Montgomery modular multiplication and
RSA exponentiation techniques,” IEE Proc.-Comput. Digit. Tech., vol. 151, no. 6, pp. 402–408, Nov.
2004.
[9] K. Manochehri and S. Pourmozafari, “Fast Montgomery modular mul-tiplication by pipelined CSA
architecture,” in Proc. IEEE Int. Conf. Microelectron., Dec. 2004, pp. 144–147.
[10] K. Manochehri and S. Pourmozafari, “Modified radix-2 Montgomery modular multiplication to make
it faster and simpler,” in Proc. IEEE Int. Conf. Inf. Technol., vol. 1. Apr. 2005, pp. 598–602.
[11] M.-D. Shieh, J.-H. Chen, H.-H. Wu, and W.-C. Lin, “A new modular exponentiation architecture for
efficient design of RSA cryptosystem,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1151–1161, Sep. 2008.
[12] M.-D. Shieh, J.-H. Chen, W.-C. Lin, and H.-H. Wu, “A new algorithm for high-speed modular
multiplication design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 2009–2019, Sep.
2009.
[13] F. Gang, “Design of modular multiplier based on improved Montgomery algorithm and systolic array,”
in Proc. 1st Int. Multi-Symp. Comput. Comput. Sci., vol. 2. 2006, pp. 356–359.
[14] S. S. Ghoreishi, M. A. Pourmina, H. Bozorgi, and M. Dousti, “High speed RSA implementation based
on modified Booth’s technique and Montgomery’s multiplication for FPGA platform,” in Proc. 2nd
Int. Conf. Adv. Circuits, Electron. Micro-Electron., 2009, pp. 86–93.
[15] G. Sassaw, C. J. Jimenez, and M. Valencia, “High radix implementation of Montgomery multipliers
with CSA,” in Proc. Int. Conf. Microelec-tron., 2010, pp. 315–318.
[16] J. C. Neto, A. F. Tenca, and W. V. Ruggiero, “A parallel k-partition method to perform Montgomery
multiplication,” in Proc. IEEE Int. Conf. Appl.-Specif. Syst., Arch. Process., Sep. 2011, pp. 251–254.
[17] A. Cilardo, A. Mazzeo, L. Romano, and G. P. Saggese, “Exploring the design-space for FPGA-based
implementation of RSA,” Micro process. Microsyst., vol. 28, no. 4, pp. 183–191, May 2004.
[18] D. Bayhan, S. B. Ors, and G. Saldamli, “Analyzing and comparing the Montgomery multiplication
algorithms for their power consumption,” in
Proc. Int. Conf. Comput. Eng. Syst., Nov. 2010, pp. 257–261.
[19] X. Wang, P. Noel, and T. Kwasniewski, “Low power design techniques for a Montgomery modular
multiplier,” in Proc. Int. Symp. Intell. Signal Process. Commun. Syst., 2005, pp. 449–452.
[20] H.-K. Son and S.-G. Oh, “Design and implementation of scalable low-power Montgomery multiplier,”
in Proc. IEEE Int. Conf. Comput. Design, 2004, pp. 524–531.
[21] R. Bhutada and Y. Manoli, “Complex clock gating with integrated clock gating cell,” in Proc. Int.
Conf. Design Technol. Integr. Syst. Nanoscale Era, Sep. 2007, pp. 164–169.
[22] D. R. Sulaiman, “Using clock gating technology for energy reduction in portable computers,” in Proc.
Int. Conf. Comput. Commun. Eng., May 2008, pp. 839–842.
[23] J. Chao, Y. Zhao, Z. Wang, S. Mai, and C. Zhang, “Low-power implementations of DSP through
operand isolation and clock gating,” in Proc. Int. Conf. ASIC, Oct. 2007, pp. 229–232.
[24] C. D. Walter, “Montgomery exponentiation needs no final subtractions,” Electron. Lett., vol. 35, no.
21, pp. 1831–1832, Oct. 1999.
[25] J. Ohban, V. G. Moshnyaga, and K. Inoue, “Multiplier energy reduc-tion through bypassing of partial
products,” in Proc. Asia-Pacif. Conf. Circuits Syst., vol. 2. Oct. 2002, pp. 13–17.
[26] J. C. Neto, A. F. Tenca, and W. V. Ruggiero, “Toward an efficient implementation of sequential
Montgomery Multiplication,” in Proc. Asilomar Conf. Signals, Syst. Comput., Nov. 2010, pp. 1680–
1684.
[27] Y.-Y. Zhang, Z. Li, L. Yang, and S.-W. Zhang, “An efficient CSA architecture for Montgomery
modular multiplication,” Microprocess. Microsyst., vol. 31, no. 7, pp. 456–459, Nov. 2007.
[28] A. P. Fournaris and O. Koufopavlou, “A new RSA encryption archi-tecture and hardware
implementation based on optimized Montgomery multiplication,” in Proc. IEEE Int. Symp. Circuits
Syst., May 2005, pp. 4645–4648.
[29] TSMC 0.13-μm (CL013G) Process 1.2-Volt SAGE-XTM Standard Cell Library Databook, Artisan
Components, Sunnyvale, CA, Jan. 2004.
[30] CIC Referenced Flow for Cell-Based IC Design, National Chip Implementation Center, Hsinchu,
Taiwan, 2008.
[31] G.Ramanjaneya Reddy, P.Harinatha Reddy, “Low power & Efficient Multiplier for RSA
Cryptosystems”, in “International Journal of Scientific Engineering & Technology Research”, vol. 03,
issue. 20, ISSN 2319-8885, September-2014, pages 4333-4339