New Be Notes

Backend.
Abstract A physical design ow consists of producing a production-worthy layout from a gate-level

netlist subject to a set of constraints. This chapter focuses on the problems imposed by shrinking
process technologies, and their solutions in the context of design ows with an emphasis on the
complex interactions between logic optimization, placement, and routing. The chapter exposes the
problems of timing and design closure, signal integrity, design variable dependencies, clock and
power/ground routing, and design signo. It also surveys logical and physical design ows, and describes
a renement-based ow.
7.1. INTRODUCTION Creating a chip layout worthy of fabrication is achieved through a sequence of
steps, coined as a design ow. Since the event of logic synthesis in the mid-80s, design ows have been
characterized by a clear separation between the logical and the physical domains. Logic synthesis
produces a gate-level netlist from an abstract specication (behavioral, RTL). Place and route produces
the layout from the gate-level netlist and technology les. Specic tools have been used to create the
clock tree and the power/ground (P/G) routing, and to achieve nal signo. Reconsidering the logical
aspects during the physical design phase was unnecessary, because timing signo could be done at the
gate level, and signal integrity was rarely an issue. A ow consisting of logic synthesis followed by place-
and-route cannot work with deep submicron (DSM). At 0.18m and beyond, interconnect becomes a
dominant factor and breaks the dichotomy between the logical and physical domains. First, we can no
longer neglect the impact of the interconnect on timing because interconnect dominates gate delays.
Second, increasing design complexity makes managing congestion more challenging. Third, ner
geometry, higher clock frequency, and lower voltage produce signal integrity problems. Logical and
physical aspects are tied together and can no longer be looked at separately: there is a need for design
tools and design ows that enable the simultaneous optimization of the logical and physical aspects of a
design. This chapter discusses logical and physical design from a ow perspective. Section 7.2 explains
why the logical and physical variables are interdependent, and why they must be considered together
during the physical implementation phase. Section 7.4 rst discusses the uncertainty/renement
principle, then presents the details of a renementbased design ow, a promising approach to solve the
problems brought by DSM. The chapter concludes with some thoughts on a full-chip hierarchical design
ow.
7.2. LOGICAL AND PHYSICAL DESIGN CHALLENGES VLSI physical design is the process of producing a
GDSII (a mask describing the complete layout that will be fabricated) from a gate-level description while
satisfying a set of constraints. The gate-level netlist is a logical description of the circuit in terms of cells
(e.g., I/O pads, RAMs, IP blocks, logical gates, ip-ops, etc.) and their interconnections. The constraints
include timing (setup, hold, slope, min/max path delays, multicycle paths, false paths), area, place
keepouts (regions of the die where cells cannot be placed), and route keepouts (regions of the die
through which nets cannot be routed), power consumption, and technology constraints (e.g., maximum
output load, electromigration rules). This section presents the challenges that must be addressed by a
physical design ow intended for DSM designs.
7.2.1. TIMING CLOSURE AND DESIGN SIGNOFF Timing closure is the problem of achieving the timing
constraints of a design. Until the mid-90s, most of the delay was in the gates, and the net capacitance
was a relatively small contributor to the total output load of the gates. Thus the impact of nets on timing
could be disregarded. Consequently timing signo could be done with static timing analysis on a gate-
level netlist, right after synthesis (Fig. 7.1, left). From there, the netlist could be handed o to physical
implementation. As process geometries scale down, interconnect becomes the predominant factor in
determining the delay. First, the gate delay depends
Logical and Physical Design:A ow Perspective 3
Logical Domain
Custom WLM
Netlist signoff
Timing library
Routing Extraction Clock tree Placement
Synthesis
GDSII
System design
3UH'60 IORZ
Statistical WLM
7RGD\V IORZ
Timing library
Routing
Placement
Synthesis
GDSII
System design
Physical Domain
Clock tree
Extraction
Figure 7.1. Flows until mid-1990s.
mostly on the output capacitance it drives, of which the net capacitance becomes the largest
contributor. Second, the delay of long nets, which depends on their capacitance and resistance,
becomes larger than gate delays. Timing signo is the process of guaranteeing that the timing
constraints will be met. Post-synthesis signo was possible when interconnect contributed less than
20% of the total output capacitance. Now that net capacitance is becoming more dominant with every
new process generation, accurate timing estimation without knowing net topologies and placements is
not possible. Since timing cannot be realistically estimated from a netlist alone, it is not practical to think
in terms of a post-synthesis netlist signo. One could attempt to account for interconnect delays by
using a statistical wire load model (WLM) based on pin count. The problem is that timing is determined
by a maximum path delay, and although such a wire load model is actually a good predictor of the
average wire load, the deviation is so large that timing becomes grossly mispredicted. Moreover,
coupling capacitance becomes more dominant over interlayer capacitance with every new process
technology because the wires are getting much closer to each other and the interconnect aspect ratio of
height to width is increasing (Fig. 7.2). In 1999, the coupling capacitance Cc the lateral coupling
capacitance between interconnects on the same layer was about three times larger than the inter-
layer capacitance Cs the capacitance due to overlapping of interconnect between dierent layers.
The ratio is projected to increase to ve in 2006 (Fig. 7.2). This means that the capacitance of a net
cannot be determined without knowing both its route and that of its neighbors.
024
68
1997 2001 2006 2009 2012
Figure 7.2. Coupling capacitance dominates inter-layer capacitance.
More generally, what is needed is a design signo solution that will validate all the design variables, i.e.,
not only timing, but also area, congestion, power, and signal integrity.
7.2.2. SIGNAL INTEGRITY Deep submicron results in a number of physical eects that must be controlled
to ensure signal integrity and correct functioning of the chip. For instance, since the inter-wire coupling
capacitance dominates the inter-layer capacitance, crosstalk can no longer be neglected, because it has
a primary eect on timing and induces signal noise. Other physical aspects such as antenna eect,
electromigration, self-heating, IR-drop, need to be analyzed and controlled. The next technology
generation will need even more detailed analysis, including inductance of chip and packaging, on-chip
decoupling, resonance frequency analysis, and soft error correction. Signal integrity problems must be
identied as early as possible because it is very costly and time consuming, if not impossible, to x them
during the nal implementation stages. The deep submicron era is causing a shift from timing closure to
design closure, where every design variable (timing, area, power, congestion, signal integrity) needs to
be monitored and optimized simultaneously when implementing the chip. This era marks the end of the
logical and physical dichotomy.
7.2.3. DESIGN VARIABLES INTERDEPENDENCY Some interdependencies are well known: overall area
increases when timing is tighter, power increases as area increases. In the DSM era more complex
interdependencies are evolving. The timing/placement interdependency is the most obvious. Without
placement, the interconnect information is insucient to evaluate the timing. Also, every placement
change will impact timing. Similarly delay decrease/increase will enable/disable placement changes.
With smaller geometry features, the complexity and the physical resources of the interconnect make
congestion (i.e., routing resource requirements) a key factor in the design closure process. Controlling
congestion is a challenging problem because of the complex timing/placement/congestion interaction.
For example, shortening nets by moving gates and/or straightening routes can help in reducing the
delay, but this pulls gates and/or nets closer, creating congestion problems. Conversely, high congestion
spots can be relieved by downsizing and/or spreading gates around, or by rerouting nets, which can
create new timing problems. Also, long wires that go across the chip cannot be designed without a
placement, and the placement itself should know about the global routing of the long wires to account
for their contribution to congestion. In addition, more local dependencies are created with area, power,
and signal integrity (e.g., crosstalk aects timing). A physical design system with an open cost function is
required to actually address all these complex interdependencies.
7.2.4. CLOCK AND P/G ROUTING In a traditional design ow, clock tree synthesis and power routing are
separate tasks. Typically, the clock tree is built once the locations of the sequential elements and gated
clocks are known, i.e., after placement. But designing clock trees and P/G networks after placement
creates routing problems that occur too late to be xed. Thus placement must know about clock and
P/G routing so that it can account for their contribution to congestion. Also, the clock tree is becoming
denser, spanning a larger chip with a tighter clock period, thus requiring good control of the skew. Fine
control of the skew should be used to optimize timing. The same principle applies to scan chains. The
impact in terms of congestion is much smaller, but it is important to re-order the scan chains so that the
placement is not overconstrained.
7.2.5. CAPACITY Designs keep getting bigger and more complex, containing tens of millions of gates. Two
capacity problems are emerging. The rst is a pure scalability problem: the raw number of objects that
need to be managed and processed is stressing memory limits and computational resources (only linear
or nlogn algorithms can reasonably be applied to a complete netlist). The second problem is the overall
complexity of a chip design, which is often divided into several blocks, with several independent design
teams working in parallel, each at a dierent pace. This problem is inherent to project management,
meaning that few very large chips are actually designed at: the trend is towards more hierarchical
designs. Hierarchical ows pose new problems for physical design, e.g., handling timing constraints
between the chip level and the block level, verifying the feasibility of the constraints at the chip level,
capturing the chip context when implementing a block, hierarchical chip verication, etc. There are no
tools to help the designer to develop a top-down timing budget of the blocks, and no tools to model the
interaction (e.g., timing and congestion) between the blocks as they are designed bottomup.
Floorplanning is still a dicult problem, especially when it must be done with timing and congestion
considerations. There are several research eorts to enable complete hierarchical physical ows, but,
because of lack of space, we do not address the hierarchy problem and instead focus entirely on at
physical design.
7.3. SURVEY OF CURRENT DESIGN FLOWS Several solutions have been proposed to solve DSM challenges
and to achieve design closure. The rst two ows we describe in this section are stand alone ows, in
the sense that they do not require new technology on top of a classical sequential ow made of
synthesis, followed by placeand-route. The last two ows are attempts in more tightly integrating
synthesis and place-and-route.
Iterative ow with Custom wire load model. This ow iterates between gate-level synthesis and place-
and-route. After place-and-route, if the constraints are not met, the netlist is back-annotated with the
actual wire load and re-synthesised (Fig. 7.1, right). Signal integrity problems are usually handled at the
detailed routing level. Place and route data are necessary to determine the timing. Trying to
compensate for the lack of these data by driving synthesis with pessimistic assumptions results in over
design. Extracting net capacitances
after place and route and feeding them back to synthesis in an attempt to seed synthesis with more
realistic capacitance estimations is not practical. This often results in a dierent netlist, with a distinct
placement and routing, thus producing dierent wire loads and timing. There is no guarantee that this
iterative process will eventually converge. Expecting the backend tools to x problems that are
recognized late in the ow is unrealistic, because the latitude of what can be done at this level is limited.
This solution is ineective for technologies at or below 0.18m.
Block-assembly ow. This ow consists of designing small blocks independently, then assembling them.
The idea is that a statistical wire load model is suciently accurate for small blocks of logic 50k cells
has been proposed [32]. The netlist is divided into blocks such that the intra-block interconnect delay
can be neglected or roughly estimated, which enables synthesis to predict the overall delay. Then the
blocks are assembled. There are several problems here. First, dividing a chip into blocks requires time
budgeting, and there is no scientic method to come up with an accurate budgeting that can be met
both at the block level and at the chip level. This results in a sub-optimal or infeasible netlist. Second,
assembly must respect the physical boundaries of the blocks to ensure that the intra-block delays are
preserved. This signicantly constrains the placement such that timing and/or congestion problems
cannot be addressed easily. Third, it is virtually impossible to estimate the inter-block delays, since long
interconnects depend on the relative placement of the blocks. Fourth, statistical wire load models may
still fail to predict timing accurately, even for small blocks, due to routing congestion. If routes in the
block are forced to meander in congested areas, the net capacitances increase substantially. Because
congestion impacts wire load model predictibility, the overall connectivity of the netlist must be
considered, which makes impossible to look at a block alone without understanding the chip-level
context it is placed. Overall, this is essentially a trial and error approach, forcing many iterations.
Furthermore, this approach does not even address the problem of signal integrity.
Constant delay synthesis ow. The delay through a logic stage (i.e., a gate and the net it drives) is
expressed as a linear function of the gain, which is the ratio of the capacitance driven by the gate to its
input pin capacitance. The rst step in the constant delay synthesis ow consists of assigning a xed
delay (i.e., a xed gain) to every logical
stage so that timing constraints are met. The second step consists of placing and routing the netlist
while preserving these gains. Constant delay synthesis is attractive because of its elegance and
simplicity, thus enabling fast synthesis. RTL-to-gate synthesis has been proven a good application of
constant delay synthesis. However, this elegance is obtained at the cost of ignoring the physical aspects.
Delay models must be input-slope dependent, and must distinguish between rising and falling signals.
The propagation of a waveform in an interconnect cannot be captured by such a simple model. The
gains can be kept constant by adapting the net capacitances and the gate sizes within limited ranges. In
practice, this means that the netlist must be changed to accommodate the placement/route/timing
constraints (e.g., re-buering), which means that delays must be re-assigned. Constant delay synthesis
assumes continuous sizing. Mapping the resulting solution onto a real discrete library can result in a sub-
optimal netlist. Also constant delay synthesis relies on convexity assumptions (see Section 7.4.3.1) that
are often not true for realistic libraries.
Placement-aware synthesis ow. This kind of ow is a general trend, because synthesis needs placement
information to estimate the timing [26, 31]. But the meaning of aware is often fuzzy. Gluing synthesis
and placement together does not help if routing information is not sucient, or if the interaction
between synthesis and placement is not under control. For example, synthesis may locally increase the
area to x a timing problem, thus creating an overlled area, to which placement will react by spreading
gates around, which will create other timing problems. A strategy leading to convergence is a major
problem. Moreover, synthesis and placement working together is clearly not sucient if one cannot
account for congestion and signal integrity, which requires an understanding of routing and physical
aspects. In other words, this approach does not go far enough in the integration of the logical and
physical domains.
7.4. REFINEMENT-BASED FLOW As stated in Section 7.2, design closure requires that all design variables
need to be considered together (timing, area, power, congestion, clock tree synthesis, P/G routing, scan
chain reordering, signal integrity). This section introduces a novel physical design ow based on the
concept of renement. It rst discusses the uncertainty/renement principle. It then describes how to
apply this principle during placement, logic optimization, clock tree synthesis, P/G routing, etc. Finally,
this section
explains how place, logic optimization, and global route interact, and how to achieve design signo.
7.4.1. THE UNCERTAINTY/REFINEMENT PRINCIPLE For any variable x, let x be its estimate. Assume that
we want to minimize a function x(p) by nding an optimum value of p. Consider the following remarks:
A parameter x cannot be optimized beyond the range it can be estimated, i.e., an optimization
procedure which computes the parameter p producing the best value x(p) is worthless if these values
are within the uncertainty of x.
The optimization of x(p) should be done with a resolution of p that cannot exceed the uncertainty of p,
e.g., if we know precisely the radius r of p (i.e., |p p| r), then we should look at x(p) by 2r
increments of p.
A variable of a design in progress is always estimated with some uncertainty. For example, the timing is
fully known only if the routes are fully known. If we have partial placement or/and route information
only, then the uncertainty on the net capacitances results in an even larger uncertainty on the timing.
Another simple example is IR-drop (voltage drop in the power nets), which can be known only as much
as the placement, toggle rates, and P/G routing are known. IR-drop itself leads to an additional
uncertainty in timing (Section 7.4.6.2). At the beginning, the only available data are the gate-level
netlist, the library, the technology les, the constraints (timing, die area, pre-placed cells and pre-routes,
power). Thus unless we have enough information about the placement and routes, it is simply useless to
do any kind of meaningful timing optimization beyond simple wirelength minimization. Indeed, if we
know the resolution with which placement and routing are determined, we can determine the
parameters of the design that can be estimated with enough accuracy to allow some valuable
optimization. If we augment the resolution of the placement and routing, new design parameters
become visible and can be optimized in turn. A ow respecting these observations starts with the
initial netlist and physically builds the design, step by step, by increasing the resolution of every
parameter simultaneously. Early in this process, there is only an approximate placement and global
routing, and the clock tree and P/G network are roughly estimated (remember that we must account for
all contributions to the congestion). At the end, there is a fully detailed,
10
placed and routed netlist, including the completed clock tree and P/G routing. Between these two
points, there is a continuous renement process from rough to detailed, and all design variables are
monitored and optimized simultaneously (when their resolution allows it) to meet the design
constraints. As the design implementation progresses, the models become more accurate (the
estimations have less uncertainty), and the actions taken to solve the problems are more and more
detailed and localized. This continuous process of model renement and design variable monitoring and
optimization are the keys to prediction and convergence. A renement technique that knows the
accuracy of each design measurement can react to a problem at the earliest appropriate time, i.e., as
early as possible and only when it becomes meaningful. For instance, as the placement and global route
are rened, wirelength and congestion can be estimated and optimized at the very beginning, then
timing, then IR-drop, then crosstalk, etc. In the sequel we discuss various aspects of implementing such
a renementbased ow.
7.4.2. PLACEMENT TECHNIQUES Placement techniques can be classied roughly in two types: analytical
(constructive) placement, and move-based (iterative improvement) placement. Analytical placement
uses linear or convex programming to minimize a closed-form, dierentiable function. Move-based
placement evaluates placement attempts and decides which moves to actually implement. We describe
two analytical placement methods, quadratic and force-directed placements; and two constructive
methods, simulated annealing and quadrisection-based placement. We then explain why the latter is a
good choice for a renement-based physical design ow.
7.4.2.1 Quadratic placement. Quadratic placement is based on minimizing the sum of the squares of the
wirelengths. Every cell has a xy coordinate in the plane, and the cost of a wire between a cell i and j is
captured by (xi xj)2 +(yi yj)2. Thus the problem translates into solving a linear system Ax+b = 0, where
x is the vector of the movable cell coordinates, A is the connectivity matrix of the cells, and b is the
vector of the xed cell coordinates (e.g., I/O pads). Without xed cells, the solution is trivial, with all the
cells on top of each other. Even with xed pads, the solution usually shows most of the cells
overlapping. Thus quadratic placement iterates squared wirelength minimization and bi-partitioning
[21]. Bi-partitioning divides the set of
cells in two and assigns each part to one area of the chip. Then squared wirelength minimization is
applied to each partition, and the process is iterated until a legal placement can be inferred. Quadratic
placement is very fast and can handle large designs. However, it aims at minimizing the squares of the
wirelengths, not the actual wirelengths. It is also not suitable for timing-driven placement. Net weighting
can be used to emphasize critical nets, but the criticality of a net may change as the placement changes,
and estimating the criticality of a net at the top level, with very fuzzy placement information, is
unrealistic. Also there is no good solution to handle congestion during quadratic placement. However,
because of its speed, quadratic placement is often used to seed a more sophisticated placement
method.
7.4.2.2 Force-directed placement. Force-directed placement is another analytical method. The principle
is to include, in the equations capturing the wirelength, a term that penalizes for overlapping cells, so
that a balance can be established between minimizing the wirelength and yielding a legal placement
[10]. Various formulations of this principle can be used, and it is usually solved by conjugate gradient
iterations or other convex programming techniques. Force-directed placement is slower than quadratic
placement, but gives signicantly better results. A key aspect is that some force-directed placement
algorithms can place objects with a large range of sizes, enabling mixed block, megacell, and standard
cell placement. However, there is a limit to what can be expressed with an analytical function. Some
costs are thus hard to capture, and the tuning of the repulsive terms can be dicult.
7.4.2.3 Simulated annealing. Simulated annealing [30] is based on move-and-cost. A move consists of
moving an object from one location to another (it can also be a swap, where two object locations are
exchanged). A move is evaluated and accepted if it improves the cost. A move that degrades the cost is
accepted with a probability usually expressed as e |cost| T , where cost is the tness of the move
(the delta cost resulting from moving it), and T is a global parameter called temperature. At the
beginning, the temperature is large, thus the probability of accepting a move degrading the cost is high.
As placement progresses, the temperature is lowered, and fewer bad moves are accepted. Accepting
cost-degrading moves allows exploring larger solution spaces and avoiding local minima. Simulated
annealing has a completely open cost function, and it can produce globally optimum solutions. However,
it is an extremely slow
12
process, limiting its application to small sets of objects. Due to its optimality potential and speed limit, it
is only used for end case placement, e.g., detailed placement.
7.4.2.4 Quadrisection and cluster-move based placement. Bisection placement iterates min-cut
balanced partitioning and movebased placement. The netlist is partitioned in two sets of similar area,
with as few nets spanning the two sets as possible [4]. Then cells are moved or swapped between
partitions to reduce the cost. When a local minimum is reached, each set is partitioned, and moves are
applied between the resulting partitions. Quadrisection is the same process when the partitioning is
made of four partitions instead of two. Experience shows that it is better suited for 2D placement.
Moving one cell at a time is far too slow to process a real design. A dramatic speedup is obtained by
allowing groups of cells to be moved together. This method is also known as multilevel hypergraph
partitioning [20]. A hierarchical tree of clusters of cells is rst built from the netlist. The top of the tree is
made of large clusters of cells, and the leaves of the tree are the individual cells. The cluster tree is built
such that connectivity is reduced, area is balanced, and functional/physical hierarchical constraints, if
any, are satised. It can be derived from a mix of balanced multi-way min-cut partitioning (top-down)
and topology driven netlist clustering (bottom-up). Then the clusters are placed and moved in a four-
way min-cut quadrisection (Fig. 7.3). When a local minimum is reached, the netlist is re-clustered, each
partition is quadrisected, and the whole process is iterated (Fig. 7.3). This placement technique has a
fully open cost function, and it produces good quality solutions. Also, since it is move based, it is more
suitable to timing-driven placement, provided that the partition resolution is sucient to estimate the
timing. This move-based method with an open cost function allows simultaneous optimization of all
design aspects (including timing, congestion, wirelength, and crosstalk).
7.4.2.5 Placement renement for physical design. A physical design ow must be aware of all the design
variables. This requires an open cost function, which can capture timing, area, power, congestion, etc.
This immediately rules out quadratic placement, which is only good at reducing squared wire length and
cannot have any understanding of critical paths or high congestion areas. Force-directed placement
methods rely on analytical expressions that can be minimized using convex programming or by iterating
conjugate gradients. They are slower and more dicult to tune than quadratic placement but capture
BA C
EF
netlist
Figure 7.3. Placement renement.
far more elaborate costs than just squared wire length. However, their cost function is semi-open
because a highly non-monotonic, non-smooth cost function cannot be easily included. Simulated
annealing has an open cost function and can produce excellent solutions, but is extremely slow and just
cannot be used on a real netlist. A placement technique based on moving clusters across partitions
allows an ecient two-dimensional placement with the benet of an open cost function. Moving a
cluster of cells, instead of one individual cell at a time, makes the placer fast. Using a fully open cost
function makes the placer aware of the impact that moving a cluster of cells will have on timing and
congestion. Also, the exibility of the cost function allows the placer to account for the impact of the
clock tree, P/G routing, and scan chains on the quality of the design (e.g., timing, congestion, IR-drop).
Another advantage of a placement based on rening the placement of objects in smaller and smaller
bins, is that it naturally allows several views of the chip. At the top level, there is a very rough placement
in a few bins. At the bottom level, there is an accurate placement with a few dozen cells per bin, but it is
still exible (i.e., no precise xy coordinate yet). In between there are placement views that can be used
to accomodate more or less disruptive netlist changes. One of the most immediate applications is to
accomodate netlist changes performed by logic synthesis and optimization [16, 5]. Another crucial
application in the context of complex chip design is to accomodate ECO (Engineering Change Order)
changes. Most ECOs are still manual and consequently very local (e.g., changing/removing/adding a
gate, connecting/disconnecting a pin to/from a neighbor net). However, as the design complexity
increases, so does the level of abstraction used by the designer. An ECO change at the RTL level can
aect hundreds of cells, and it is even more disruptive at the behavioral level. Having several levels of
placement resolutions eases dramatically the integration of disruptive ECO.
14
In the rest of this discussion, we will assume the use of a move-based quadrisection cluster placer. The
placement is rened into smaller and smaller bins. As the number of quadrisections increases, the place
and route information is more detailed, and more sophisticated estimation models using more detailed
resolution can be used. Placement is rst driven by wire length and congestion until it reaches a
resolution for which there is enough information for meaningful timing estimation and optimization. At
this point, and from this point only, logic optimization to x timing and congestion problems can be
done. Placement and logic optimization then proceed together, rst addressing timing and congestion,
then considering crosstalk and signal integrity as the renement allows.
7.4.3. LOGIC OPTIMIZATION The gate-level netlist is initially synthesized with no placement information,
and therefore with a crude estimation of delays. Timing becomes meaningful when enough placement
information is available. Only at this moment the logic can be revisited with a good understanding of
timing, as well as congestion and power. Physical logic synthesis and optimization consists of changing
the netlist to optimize physical aspects of a design (timing, area, power, congestion, etc.) in the context
of place and route information. It is signicantly dierent from gate synthesis, since it has many more
parameters to consider, and has to properly interact with the placer and router. Physical logic synthesis
and optimization has the following requirements. It needs enough place and route information.
Placement and routing must be able to accommodate for the logic changes. It is by nature local, because
it goes together with placement renement, and thus it should not disrupt the logic to a scale larger
than the placement resolution. It should take advantage of accurate physical data, in particular for
timing (input slope dependence, output load dependence, interconnect delay, rising/falling signal delay
dependence, crosstalk awareness). It also needs an ecient incremental static timing analyzer, since
many local logic transformations may be needed and must be reected at the global level. The static
timing analyzer must support multi-cycle paths, false paths, transparent latches, and be crosstalk aware
(see Section 7.4.6.1). The delay model accuracy must match the resolution of the placement. The
simplest delay metric is in terms of total net capacitance. In this model, all fanouts of a net have the
same delay, since the net is
approximated as an equipotential surface. It is valid as long as the metal resistance is negligible with
respect to the gate resistance. With shrinking features, the metal resistance per unit of length increases,
while the gate output resistance decreases. Metal resistance cannot be ignored for an increasing
number of nets. A net must be treated as a distributed RC network with dierent delays at its fanouts.
Elmore delay can be used as a rst approximation, but more detailed models are required when metal
resistance increases. Logic optimization can signicantly move cells and locally modify the timing and
area distribution. As said above, logic optimization should not disrupt the netlist to a scale larger than
the placement resolution (i.e., the size of a bin) to ensure that placement can accommodate the
disruption. Thus as placement resolution increases, the logic optimization becomes less disruptive. At
higher levels of quadrisection, the placement is very exible, and aggressive re-synthesis and technology
mapping can be used. At lower levels, only local optimizations, e.g., sizing and buering, or very focused
re-synthesis, should be authorized. After detailed placement, only restricted sizing can be applied.
Among the physical logic synthesis and optimization techniques, one can distinguish:
Load and driver strength optimization. This includes gate sizing, buering, pins swapping, gate cloning.
Timing boundary shifting. This includes transparent latches (i.e., cycle stealing), useful skew, and
retiming.
Resynthesis and technology remapping.
Redundancy-based optimization
Area and/or power recovery.
These transformations have to decide the placement of the cells. Some new logic transformations
specic to physical design, such as synthesis for congestion relief [8], or logic optimization for signal
integrity, are emerging. In the sequel we discuss some logic optimization methods in the context of
physical design: sizing, buering, resynthesis and technology mapping.
7.4.3.1 Sizing. The goal of gate sizing is to determine optimum sizes for the gates so that the circuit
meets the delay constraints (slope, setup, and hold) with the least area/power cost. A larger gate will
have a higher drive strength (lower resistance) and hence it can charge/discharge output capacitances
faster. However, it usually also
16
has a higher input capacitance. This results in the preceeding gate seeing a larger capacitive load and
thus suering an increased delay. Downsizing a gate o the critical path, but driven by a cell on the
critical path, may result in decreasing the capacitance seen by the driver, but it can also slow down the
o-critical path to the point where it becomes critical. Sizing thus requires a careful balancing of these
conicting eects. An optimal solution will require the coordination of the correct sizes of all the gates
along and o-critical paths. We can consider a global solution [6], where the sizes for all the gates in
some critical section are determined simultaneously. Analytical techniques has been proposed for sizing
in the continuous domain (e.g., linear programming [3], posynomial [11], convex programming [17, 29]),
and various attempts have been made to use these results with discrete sizes in a cell library based
design style [15, 2]. These method can be practical at the gate level, where full accuracy is not an issue,
or at the transistor level, when the models are well characterized and have nice convex properties. The
theory of constant eort [27] and its more usable approximation, constant delay [13], provide a way to
select cell sizes directly for each stage. For example, the optimal delay on a fanout-free path is obtained
by distributing the eort evenly among the logical stages. This means that if the delay of the whole path
is xed, then the delay of every stage must be kept constant. Thus, cell sizes can be selected to match
this constraint by visiting all the gates in a topological order, starting from the outputs. See Section 7.3
for some limitations of constant delay sizing. Sizing at the physical level must use the most accurate
delay model, including interconnect delay, and must consider the input slope eect, as well as the
dierence between falling and rising delays. This cannot always be captured by analytical techniques.
Also these techniques assume that the input pin capacitances of a gate can be expressed as a smooth
function (usually linear) of the gates size (i.e., drive strength). Unfortunately this assumption is often
broken: input pin capacitance does not always nicely correlate with gate size, and sometimes it is not
even a convex function. It is possible to design libraries that specically meet these requirements, but
library design and validation is a huge burden and investment. Alternatively, discrete sizing can be done
using global search techniques [7]. These techniques attempt to reach a global optimum through a
sequence of local moves, i.e., single-gate changes. While these are computationally expensive, they can
take advantage of very accurate delay models, they are not limited by assumptions needed by analytical
tech
niques (e.g., a well-built library with convex input pin capacitances), they can enforce complex
constraints by rejecting moves that violate them (e.g., validity of the placement), and they can
simultaneously combine sizing with other local logic optimizations (e.g., gate placement, buering, pin
swapping). They have been shown to provide good results for discrete libraries. They are particularly
eective in the context of physical design, since they can focus on small critical sections, use the most
accurate delay models including interconnect delay, and nd a small sequence of moves that meets
timing constraints while maintaining the validity of the placement. Analytical sizing methods based on
simple convex delay models are global and fast. Discrete sizing methods are slower, but very accurate
and focused. Both methods should be used by physical synthesis, driven by the size of the bin during the
placement renement, because the bin size determines the extent of the authorized netlist change, and
the accuracy of the physical information.
7.4.3.2 Buering. Buering serves multiple functions:
Long wires result in signal attenuation (slope degradation). Buers (a.k.a. repeaters in this context) are
used to restore signal levels.
A chain of one or more buers can be used to increase the drive strength for a gate that is driving a large
load.
Buers can be used to shield a critical path from a high-load ocritical path. The buer is used to drive
the o-critical path load so that the driver on the critical path sees only the buers input pin
capacitance in place of the high load.
It is possible to come up with ad-hoc solutions for each of the above cases. For example, repeaters can
be added at xed wirelength intervals determined by the technology. However, critical nets often need
to be buered to satisfy more than one of the above listed requirements. Thus, we prefer algorithmic
solutions that balance the attenuation, drive strength, and shielding requirements. This problem is
hopelessly NPhard. An interesting solution exists for the case when the topology of the buer tree is
xed and the potential buer sites are xed. This is the case when the global route for the net is already
determined and the buer tree must follow this route. A dynamic programming algorithm, using an
Elmore delay model for the interconnect, nds an optimal solution in polynomial time [12]. Various
attempts have been made to overcome the two limitations of this algorithm (the fact that the topology
is already xed, and that the Elmore delay model does not accurately cap
18
ture the resistive eects of DSM technologies). The former is considered to be more of a problem,
because xing the topology can severely limit the shielding possibilities and lead to overall sub-optimal
solutions. What is needed is the ability to determine the net route as part of the buering solution.
Some attempts have been made to develop heuristic solutions [23, 28], including adding additional
degrees of freedom like wire and driver sizing. These techniques give better solutions than the xed
topology approach. Various constraints must be handled when routing and buering the net, like
routing blockages and restricted buer locations due to place keepouts [33, 18]. Determining the
optimal buer tree for a given net is only one part of the complete buering problem. Given that
buering a given net can change the constraints (required time/slack, load) on the pins of another net,
the nal solution is sensitive to the order in which the nets are visited. In addition, once a net is
buered, the gates may no longer be optimally sized. Resizing gates before the next net is buered can
modify the buering problem. Researchers have considered combining sizing and buering into a single
step [19], but again this problem is very complex and far from being solved.
7.4.3.3 Resynthesis and technology remapping. Technology mapping attempts to nd the best selection
of cells from a given cell library to meet a given delay constraint with the least area/power. Technology
remapping during physical synthesis attempts to nd a better mapping by using existing physical
information for determining interconnect capacitance and delay. Technology mapping has been well
studied [14]. Knowing where to apply it is the key. The challenge is to work on sections small enough to
not signicantly disturb the placement, yet signicant enough to improve the design. Another challenge
is to determine where to place the new cells created during the remapping phase [22]. Some simple
solutions based on xed boundary locations can be used during the mapping itself, with a clean-up step
to make the placement legal [24]. Logic restructuring is the strongest in the suite of logic optimization
techniques because it can signicantly change the structure of the netlist. This makes it the most dicult
to apply in a physical design ow where the changes are expected to be small to maintain the validity of
the placement. However, the basic ideas in restructuring can still be used to improve the timing and
congestion properties of the netlist as long as the changes are focused on key sections and the
modications are within the space of acceptable changes, e.g., the changes do not violate
capacity/congestion constraints for various physical regions of the
design. The restructuring techniques themselves range from gate collapsing/decomposition (based on
commutative, associative, and distributive properties of logic functions [14]), to sophisticated Boolean
resynthesis (e.g., BDD-based). Again the challenge is knowing where to apply them, and maintaining the
constraints imposed by the existing physical design. Several researchers have worked on this [25],
however, with no dominant technique emerging. Timing driven logic resynthesis based on arrival times
of critical signals can be used the basic idea is using late signals as close to the outputs as possible.
However, one also wants to keep together signals that are physically close to each other. This means
that in addition to its arrival time, the location of a signal must be taken into account as the Boolean
function is synthesized. Resynthesis technique can help relieving congestion. An innovative technique
consists of designing the interconnect before the logic. The logic is then synthesized by following the
Boolean constraints established by the interconnect [8]. This problem does not always have a solution;
thus either the interconnect must have enough exibility to enable the synthesis of the logic, or the logic
itself must be restricted up front to guarantee a feasible solution.
7.4.4. CLOCK TREE SYNTHESIS Clock tree synthesis must be part of the renement process, because it
signicantly aects congestion, power dissipation, and timing via clock skew. Targeting zero-skew clock
trees has no justication but historical. Instead, skew can be used to optimize the timing. Futhermore, a
non zero-skew clock tree ensures that all the sequential elements will not toggle at the same time, thus
reducing undesirable peak power. After some level of quadrisection, the distribution of the sequential
elements and gated clocks in the bins will not change substantially. At this level one can determine the
amount of routing and buering needed to carry the clock signal, and the trunk of the clock tree can be
built (Fig. 7.4, left). From this point these resources are accounted for by placement for congestion. As
the partition size decreases and the detail increases, the clock tree is rened and its resources are
updated. This leads to no congestion surprises at the end, and it gives tight control over clock skew
requirements, since the placement is exible enough so that clock pins with common skew can be
grouped together. It also allows a ne control of the skew for timing optimization, because the critical
paths are continuously monitored.
20
Figure 7.4. Clock tree and P/G network renement.
Note that the scan chains can be similarly rened, un-stitched and re-stitched to accommodate the
placement of sequential elements and to minimize the congestion while still meeting the scan ordering
constraints.
7.4.5. POWER/GROUND ROUTING The power/ground network can have a huge impact on congestion.
Initially power routing is performed according to a user-dened routing topology (e.g., chip ring, power
grid). At some level of the quadrisection, IR-drop analysis can be done to check the reliability of the P/G
network (Fig. 7.4, right). This assesses the quality and integrity of the power routing, because the power
rail currents will not change much as the placement is rened. The power consumption depends on net
capacitances and toggle rates (number of transitions on a net per unit of time). The switching activity
can be obtained by an external simulation (e.g., VCD le), or by an on-the-y simulation (slow, but
accurate), or by probabilistic analysis (fast, but easily misleading). With the distribution of the current
sources in the bins, one can extract a power network that is simulated to produce a IR-drop map. This
helps in adjusting the power grid, since at this level the placement is exible enough to accommodate
for adding and/or widening power stripes.
7.4.6. SIGNAL INTEGRITY Signal integrity refers to the many eects that cause the design to malfunction
due to the distortion of the signal waveform. These eects were not important in the previous
generation of designs. They have become key issues for the correct functioning and performance of
DSM chips. In this section we identify the following signal integrity eects
aggressor
victim
CC
CL
increased delay
CC
CL
noise propagation
Figure 7.5. Noise and delay crosstalk eects.
and outline possible solutions to address them: crosstalk for delay and noise; IR drop; electromigration;
and inductance.
7.4.6.1 Crosstalk. When a signal switches, the voltage waveform of a neighboring net may be aected
due to cross-coupling capacitance between interconnects. This eect is called crosstalk. Crosstalk can
hurt both timing and functionality. It was not an issue with designs of 0.5m or larger line-widths, but as
the technology decreases to 0.25m and below, the coupling capacitance becomes the dominant factor
(Section 7.2.1), and induces crosstalk eects that cannot be ignored. Crosstalk analysis determines the
noise induced on a net (the victim) by its switching neighboring nets (the aggressors). Analysis can be a
static or dynamic. When the victim net is quiescent, or if there is a separation of the switching windows
of the victim net from the aggressor net, crosstalk induces a noise that can be analyzed statically. If the
aggressor net causes enough voltage variation in the victim net to aect its digital state (i.e., from logical
1 to 0, or vice-versa), the noise propagates and can eventually be latched by a ip-op to produce a
functional fault (Fig. 7.5, top). When the switching windows of the aggressor and victim nets overlap,
crosstalk delays (respectively speeds up) the victim net if the aggressor net switches in the opposite
(respectively same) direction, which may cause setup (respectively hold) problems (Fig. 7.5, bottom). It
has been proposed to estimate the crosstalk eect by replacing the cross-coupling capacitance Cc with a
grounded capacitance of 2 Cc for worst case (respectively 0 Cc for best case). This method is
22
0.0 0.2 0.4 0.6 0.8 time (ns)
0.0
victim input aggressor input
1.2
2.0
0.4
0.8
1.6
2.4
0.12ns
0.06ns
0x cap model 2x cap model
same direction opposite direction victim output:
aggressor net
victim net
Figure 7.6. RC extraction for crosstalk analysis.
oversimplistic. First, taking 2Cc for worst case is not an upper bound on the delay induced by opposite
direction switching. Second, it does not capture the dynamic eect of crosstalk. Such a method simply
cannot be used for an accurate timing signo. Fig. 7.6, bottom, shows the actual waveforms of the
victim net for an agressor switching 0.1ns after the victim signal, both for the same and opposite
directions. It clearly shows that the waveforms obtained by considering a static 2 Cc and 0 Cc
grounded capacitance are not good approximations. The actual waveforms show that the victim nets
delay is aected on a range of 0.06ns, which is substantial. If the agressor net switching window is closer
to the victim nets, this range is larger, and it decreases as the window are further away from each
other. A solution consists of a static timing analyzer that generates the switching-time windows for each
net [1], together with an accurate crosscoupling RC extractor (Fig. 7.6, top). The dynamic eect of
crosstalk
extra space
grounded shields
victim
reroute
Figure 7.7. Fixing cross-talk at DR.
can thus be accurately analyzed. The same coupling-eect analysis is used for noise analysis. Instead of
evaluating the crosstalk contribution to path delay, the waveform calculator determines the amplitude
of the induced noise on the victim net. If the noise is larger than some threshold associated with the cell
driven by the net, a violation is generated. Typically, crosstalk analysis is done at the post layout stage
(e.g., using a transistor level simulator like SPICE). Then the router tries to x the problems, e.g., by
spacing and/or shielding the aicted nets, or by switching them to dierent layers (Fig. 7.7). This can
require a major rip-up and re-route eort that has no guarantee to be automatically feasible (the reality
is that manual xes are required). Iterating such a tedious post-layout analysis and xing process can
result in a long design time. Post-layout xing is a costly process and may not converge, so crosstalk
needs to be addressed as early as possible in the ow. The router can be constrained to avoid crosstalk
eects in the rst place, e.g., forbidding a maximum length of parallel routing between any pair of nets.
However, this method is based on empirical data and does not reect the physics of crosstalk, which
must include signal-switching window dependency. The router ends up over-constrained, leading to
unresolvable congestion problems. There are attempts at implementing crosstalk avoidance during
placement and global routing. The idea is that even if detailed routing is not available, the signal
switching windows can be used statistically to iden
24
IR drop delay 0 V 0.114 ns 0.15 V 0.126 ns (+10%) 0.3 V 0.143 ns (+25%) 0.5 V 0.184 ns
(+61%)
input1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Figure 7.8. IR drop causes delay.
tify the potential problem nets. The problem nets are given more white space during global routing so
that the detailed routing has enough resources to perform spacing, shielding, or re-routing and
automatically x crosstalk problems. Also this requires the detailed router to have gridless capabilities
with variable width and variable spacing so that crosstalk issues can be addressed eectively.
7.4.6.2 IR-Drop. IR Drop is the problem of voltage change on the power and ground supply lines due to
high current owing through the P/G resistive network. When the voltage drop (respectively rise) in the
power (respectively ground) network becomes excessive, this causes delays in gates that can produce
timing violations and functionality problems. It also causes unreliable operation because of smaller noise
margins. As an example, 1A of current owing through 1 of interconnect causes a 1V drop, which is 2/3
of the 1.5V power supply value. Fig. 7.8 shows the eect of voltage drop at VDD on the performance of a
buer. IR drop becomes more critical for DSM designs because (1) there are more devices, thus a higher
current demand; (2) the wire and contact/via resistance increases because of narrower wires and fewer
contacts/vias; and (3) the supply voltage is decreased to 1.5 volts and below. For many designs, IR drop
is being addressed by over-designing the power network with wide power buses and multiple power
meshes. However, this severely reduces the available routing resource and is likely to cause routing
congestion problems.
Figure 7.9. Net failure due to electromigration.
An accurate IR drop analysis is done with a transitor level simulation that computes the dynamic current
ows. This is a costly process, and at too low a level to allow xing problems. The simulation can be
done at the cell level using cell-level power models, an RC model for the interconnect, and their
switching directions and frequencies. This produces an average current per cell, from which the average
voltage drop can be approximated with the resistance model of the power and ground network.
Although less accurate, this method can identify regions with high voltage drops. When the level of
quadrisection is ne enough, the interconnect loading can be obtained accurately while the placement
has enough exibility to accomodate the power routing change. Optimizing the power routing for both
IR-drop and congestion can be done at this level. The currents of the cells in a bin are accumulated and
represented as a single current source to the power grid. A fast simulation is then used to evaluate the
voltage drop so that the power network can be adjusted accordingly. Power stripes can be narrowed or
suppressed to free some routing resources, or widened and augmented to meet the IR drop
specication.
7.4.6.3 Electromigration. Increased current density through interconnect for an extended period of time
causes resistance to increase. This results in self-heating and metal disintegration, which creates an
open or short in the circuit (Fig. 7.9). This eect is called electromigration (EM), and was not a problem
until the advent of DSM. As cir
26
cuits get faster and bigger, more currents ow through the interconnects, which at the same time are
getting narrower with every new generation. The current density (the amount of current owing per
unit area) increases super-linearly with every new DSM generation. It is no longer feasible for
manufacturing to provide enough cross-section area on the interconnects to guarantee net integrity at
all line widths. EM has traditionally been addressed by over-designing with wide power buses, which is
where most of the EM issues are expected. However, over-designing power routing may cause
congestion problems and is no longer acceptable. Also, electromigration eects have become important
for clock trees, and will aect signal nets in the future. For clock and signal nets, there may not be
enough contacts/vias to sustain the current through the interconnect. A solution which will provide
sucient interconnect widths without excessive over-design is necessary. Tools that compute the
current densities from the layout are used to analyze EM problems, which are then xed manually by
widening the problem nets. However, this requires considerable expertise, and the extra space
necessary for widening the nets may not be available. It is possible to address EM much earlier in the
design ow, by calculating the required width of the interconnect as placement and routing are rened.
Once the placement is accurate enough for a good estimation of the net capacitances, one can calculate
the current ow in these nets, which together with the switching activity of the nets, enables
electromigration analysis. The interconnect width and via count needed to support the current is
identied at each level of quadrisection. These routing resources are then allocated by the global router.
Consequently, the detailed router will be able to satisfy the electromigration routing requirements
together with other requirements.
7.4.6.4 Inductance. As clock frequency of the design increases to above 500 MHz with process
technologies of 0.13m and below, inductance becomes an important factor for long nets with high
frequency signals. On-chip self inductance aects long nets, and mutual inductance aects nets running
in parallel for long distances. Clock nets should be analyzed for self inductance, and bus signals for
mutual inductance. A common approach to address inductance consists of sandwiching the signal layers
between power planes (Fig. 7.10, top left). However, this is no longer used due to high manufacturing
costs and high power consumption. An easier way to limit the inductance eects is to provide shorter
current return paths for the high frequency signals. For clock networks, the current return path should
be a ground clock shield running parallel to the clock net in the same layer (Fig. 7.10, top right). For
GND shield clock signal VDD shieldVDD planeGND plane
GND shield bus signals VDD shield
+ +
Figure 7.10. Inductance avoidance.
buses, the common practice is to insert a ground wire every 4 to 8 bus signals (Fig. 7.10, bottom left).
For pairs of long wires running in parallel, staggered inverters cancel the mutual inductive eects (Fig.
7.10, bottom right). On-chip inductance is a dicult problem without a clear emerging solution, and it is
an area of active research. Even though inductance today aects only the clock tree, signal nets will be
aected in the future as the clock frequency rises. New global and detailed routing strategies will be
needed to handle this eect.
7.4.7. ROUTING One of the most important requirements to achieve good routing is the correlation
between the global and detailed routers. The detailed router must nalize what the global router
predicted, and the routing resources allocated by the global router must actually be available to the
detailed router. It is unrealistic to use two uncoupled global and detailed routers, since congestion is a
dominant factor in DSM design. The global router must be timing, congestion and crosstalk driven, and it
must support multiple widths and spacing. Since at the beginning it is impossible to know the route of
long wires, one can use probabilistic route to smear the net in the region where it is likely to lie. As the
placement is rened and timing models become more accurate, this probabilistic region shrinks, and
eventually the net can be fully dened. The detailed router should support both gridded (for speed) and
gridless (for detailed optimization) modes. It must support variable wire width, e.g., for delay
optimization. It must enforce numerous DSM
28
Figure 7.11. Placement/synthesis/routing interaction.

routing rules, e.g., metal/via antenna, end-of-line, minimum area, via array generation for fat-wire
connections, and variable spacing based on width. Also it must have timing, congestion, crosstalk,
electromigration awareness at all times (e.g., during layer assignment and wire sizing).
7.4.8. PLACE/SYNTHESIS/ROUTE INTERACTION Fig. 7.11 illustrates one aspect of the placement, routing,
congestion, timing, and synthesis interaction. At the beginning, a probabilistic route is generated that
spans the area the net will likely lie in. As placement is rened, it takes into account the congestion
produced by the smeared route. When synthesis decides to modify the netlist, e.g., by adding buers to
x a timing problem, the contribution of these buers to the congestion is also taken into account. The
global route is consequently rened, seeded here by the placement and the buers introduced by logic
optimization.
7.4.9. DESIGN SIGNOFF As the level of quadrisections increases, the layout is known with more detail,
and the design variables (area, congestion, timing, power, signal integrity) are estimated more
accurately until their optimization becomes meaningful. From this point, they start to be optimized
along with the other variables. Eventually one reaches a point where the layout is known with sucient
detail to be able to accurately predict the outcome of the implementation process (nal GDSII). The
state of this design is a physical prototype. Although it occurs well before the design is fully placed and
routed, a physical prototype corresponds to the moment where one can truly deliver an actual design of
signo quality to the backend tool.
7.5. CONCLUSION & PERSPECTIVE Every aspect of the design complexity keeps increasing: more
transistors, higher clock frequencies, and more pronounced physical eects. We discussed the
challenges that need to be addressed during the physical implementation from gate-level netlists to a
production worthy GDSII. We explained how a ow based on placement and route renement (including
clock tree, power routing, and scan chains) with an open cost function, together with physical logic
synthesis and optimization, meets some of these challenges. It enables early estimation and
optimization of congestion, timing, and physical eects when the design has enough exibility to
accomodate perturbations produced by the optimization procedures: the physical prototype restores
the timing signo lost in the mid-1990s. The interconnect centric DSM era raises new problems, where
placement, routing, and logic optimization are tightly interdependent. Hierarchical design is a necessary
direction to handle the capacity and project management problems of multi-million gate designs. This
creates new problems, like chip- and block-level physical contexts, and timing abstraction. In essence,
the design process above gate-level needs to be more placement aware. At the same time, RTL-to-gate
synthesis should not over-optimize a netlist with no place and route data. This leads to a very dierent
full-chip design ow that consists of the following steps:
Behavioral synthesis together with oorplanning. Several scenarios can be explored, with a tradeo
between the block sizes and shapes, the corresponding oorplan, and the resulting chip performances.
As a part of oorplaning, the chip is broken up into hierarchical blocks, which include physical and timing
constraints sucient to allow independent physical implementation. Also toplevel routing and pin
assignment on the block is done to exchange signals between blocks.
Fast RTL-to-gate synthesis of the blocks. The focus is on nding the best logic structure, not on having a
fully sized and buered netlist. It could be a grossly mapped netlist, since at this point it does not make
much sense to push timing optimization beyond a simple model, e.g., constant delay.
Logical and physical design of the blocks. This is the process described in this chapter. After some
placement renement, enough interconnect information is known to allow resynthesis, remapping,
resizing, rebuering, etc. After signo on the physical prototype,
30
the physical implementation of a block within its abstracted chip context is then nalized using the same
renement process.
Chip assembly. Block abstractions and glue logic are assembled using the same renement process. Note
that several levels of block abstractions can be used (from black box to GDSII), which enables early chip-
level validation while the blocks are being designed. Final extraction and verication would follow.
To realize such a ow, some of the following questions must be answered:
What should drive a hierarchical design ow: oorplaning or synthesis?
How do we perform timing-driven oorplanning?
How do we re-dene gate-level synthesis to facilitate physical optimization?
How do we derive block-level timing budgets?
How do we accurately describe the chip context during block implementation (parasitics, timing
exceptions, clock tree constraints)?
Must we dene new clock tree design methodologies in such a ow?
What is the verication and test methodology, functional and physical, in such a ow?
References
[1] R. Arunachalam, K. Rajagopal, L. Pileggi, TACO: Timing Analysis with Coupling, Proc. of DAC2000,
June 2000. [2] F. Beeftink, P. Kudva, D. Kung, L. Stok, Gate-Size Selection for Standard Cell Libraries,
Proc. of ICCAD98, pp. 545550, Nov. 1998. [3] M. Berkelaar, J. Jess, Gate Sizing in MOS Digital Circuits
with Linear Programming, Proc. of EDAC90, pp. 217221, 1990. [4] M. A. Breuer, A Class of Min-cut
Placement Algorithms, Proc. of 14th DAC, pp. 284290, 1977. [5] R. Carragher, R. Murgai, S.
Chakraborty, M. Prasad, A. Srivastava, N. Vemuri, Layout-driven Logic Optimization, in Proc. of
IWLS2000, pp. 270276, May 2000. [6] O. Coudert, R. Haddad, S. Manne, New Algorithms for Gate
Sizing: A Comparative Study, Proc. of 33rd DAC, pp. 734739, June 1996.
REFERENCES 31
[7] O. Coudert, Gate Sizing for Constrained Delay/Power/Area Optimization, in IEEE Trans. on VLSI
Systems, Special Issue on Low Power Electronics and Design, 5-4, pp. 465472, Dec. 1997. [8] O.
Coudert, A. Narayan, Subsystem: Logic Optimization, Tech. Report, Monterey Design Systems, Sept.
1998. [9] W. Donath, P. Kudva, L. Stok, P. Villarubia, L. Reddy, S. Sullivan, Transformational Placement
and Synthesis, Proc of DATE2000, pp. 194201, March 2000. [10] H. Eisenmann, F. M. Johannes,
Generic Global Placement and Floorplanning, in Proc. of 35th DAC, pp. 269274, June 1998. [11] J. P.
Fishburn, A. E. Dunlop, TILOS: a Posynomial Programming Approach to Transistor Sizing, ICCAD85, pp.
326328, Nov. 1985. [12] L. P. P. P. van Ginneken, Buer placement in distributed RC-tree networks for
minimal Elmore delay, ISCAS90, 1990. [13] J. Grodstein, E. Lehman, H. Harkness, B. Grundmann, Y.
Watanabe, A Delay Model for Logic Synthesis of Continuously-Sized Networks, ICCAD 95, pp. 264
271, Nov. 1995. [14] G. D. Hachtel, F. Somenzi, Logic Synthesis and Verication Algorithms, Kluwer
Academic Pub., 1996. [15] R. Haddad, L. P. P. P. van Ginneken, N. Shenoy, Discrete drive selection for
continuous sizing, ICCD97, 1997. [16] S. Hojat, P. Villarubia, An Integrated Placement and Synthesis
Approach for Timing Closure of PowerPC Microporcessor, in ICCD97, pp. 206210, Sept. 1997. [17] B.
Hoppe, G. Neuendorf, D. Schmitt-Landsiedel, Optimization of High-Speed CMOS Logic Circuits with
Analytical Models for Signal Delay, Chip Area and Dynamic Power Dissipation, in IEEE Trans. on CAD, 9-
3, pp. 236246, March 1990. [18] J. Hu, S. S. Sapatnekar, Simultaneous Buer Insertion and NonHanan
Optimization for VLSI Interconnect under a Higher Order AWE Model, ISPD99, 1999. [19] Y. Jiang, S. S.
Sapatnekar, C. Bamji, J. Kim, Interleaving Buer Insertion and Transistor Sizing into a Single
Optimization, IEEE Trans. on VLSI, Dec. 1998. [20] G. Karypis, R. Aggarwal, V. Kumar, S. Shekhar,
Multilevel Hypergraph Partitioning: Application in VLSI Domain, in Proc. of 34th DAC, pp. 526529,
June 1997. [21] J. M. Kleinhans, G. Sigl, F. M. Johannes, K. J. Antreich, GORDIAN: VLSI Placement by
Quadratic Programming and Slicing Optimization, IEEE Trans. on CAD, 10, pp. 356365, March 1991.
32
[22] J. Lou, A. Salek, M. Pedram, An Exact Solution to Simultaneous Technology Mapping and Linear
Placement Problem, ICCAD97, pp. 671675, Nov. 1997. [23] T. Okamoto, J. Cong, Interconnect Layout
Optimization by Simultaneous Steiner Tree Construction and Buer Insertion, ISPD96, 1996. [24] M.
Pedram, N. Bhat, Layout driven technology mapping, 28th DAC, pp. 99105, June 1991. [25] M.
Pedram, N. Bhat, Layout driven logic restructuring and decomposition, ICCAD91, Nov. 1991. [26] N.
Shenoy, M. Iyer, R. Damiano, K. Harer, H.-K. Ma, P. Thilking, A Robust Solution to the Timing
Convergence Problem in High Performance Designs, in Proc. of ICCD99, pp. 250257, Oct. 1999. [27] R.
F. Sproull, I. E. Sutherland, Logical Eort: Designing for Speed on the Back of an Envelope, in Proc. of
IEEE Advanced Research in VLSI Conference, 1991. [28] A. Salek, J. Lou, M. Pedram, A simultaneous
routing tree construction and fanout optimization algorithm, ICCAD98, pp. 625 630, Nov. 1998. [29] S.
S. Sapatnekar, V. B. Rao, P. M. Vaidya, S. M. Kang, An Exact Solution to the Transistor Sizing Problem for
CMOS Circuits Using Convex Optimization, IEEE Trans. on CAD, 12-11, pp. 1621 1634, Dec. 1993. [30]
C. Sechen, VLSI Placement and Global Routing Using Simulated Annealing, Kluwer Pub., Deventer,
Netherlands, 1988. [31] G. Stenz, B.M. Reiss, B.Roheisch, F.M. Johannes, Timing Driven Placement in
Interaction with Netlist Transformations, in Proc. of ISPD97, pp. 3641, 1997. [32] D. Sylvester, K.
Keutzer, Getting to the Bottom of Deep Submicron, in Proc. of ICCAD98, pp. 203211, Nov. 1998. [33]
H. Zhou, M. Wong, I-M. Liu, A. Aziz, Simultaneous Routing and Buer Insertion with Restrictions on
Buer Locations, 36th DAC, June 1999.
SMRL
Sebastian Offermann1 Robert Wille1 Gerhard W. Dueck2 Rolf Drechsler1 1Institute of Computer
Science, University of Bremen, 28359 Bremen, Germany 2Faculty of Computer Science, University of
New Brunswick, Fredericton, Canada {offerman,rwille,drechsle}@informatik.uni-bremen.de
gdueck@unb.ca
AbstractIn the past, reversible logic has become an intensely studied research topic. This is mainly
motivated by its applications in the domain of low-power design and quantum computation. Since
reversible logic is subject to certain restrictions (e.g. fanout and feedback are not allowed), traditional
synthesis methods are not applicable and specic methods have been developed. In this paper, we focus
on synthesis of multiplier circuits in reversible logic. Three methods are presented that address the
drawbacks of previous approaches. In particular, the large number of circuit lines in the resulting
realizations as well as the poor scalability. Finally, we compare the results to circuits obtained by general
purpose synthesis approaches. I. INTRODUCTION The number of elements integrated in digital circuits
grows exponentially, leading to enormous challenges in Computer Aided Design (CAD). Due to this
exponential growth physical boundaries will be reached in the near future. Furthermore, power
consumption of circuits becomes a major issue. As a
consequence,researchersexpectthattraditionaltechnologies
likeCMOSwillreachtheirlimitsinthenearfuture[1].Toface this, alternative computation technologies are
needed. This motivates research in the domain of reversible logic. Reversible logic realizes bijections, i.e.
one-to-one mappings of Boolean functions. The resulting reversibility allows promising applications e.g.
in the domain of low-power design and quantum computation. In fact, it has been proven that zero
power dissipation is only possible if computations are performed in an invertable manner [2], [3].
Reversible circuits are driven by their input signals only (and accordingly without additional power
supplies) have already been physically implemented [4]. Besides that, the application in quantum
computation is triggered by the fact that every quantum operation inherently is reversible [5]. Quantum
computers allow to solve practically relevant problems (e.g. factorization) much faster than traditional
circuits [6], [7]. As a result, reversible logic has become an intensely studied research area. In particular,
synthesis aspects are of interest since reversible circuits are subject to certain restrictions, e.g. fanout
and feedback are not allowed [5]. In the past, several general purpose synthesis approaches have been
introduced (e.g. [8], [9], [10], [11], [12]). The realization of reversible circuits for arithmetic functions is
of particular interest, since these functions occur naturally in circuit design. In the following, we focus on
synthesis of multiplier circuits in reversible logic. First reversible realizations of multiplier have already
been introduced (see e.g. [13], [14], [15]). However, they often rely on a very special type of reversible
gates (e.g. the TSG gate, the HNG gate, or the PFAG gate) and additionally have been proposed for very
small bit-width only (in fact, 44 multiplier have been introduced). Besides
that, multiplier can also be realized with the help of the (general purpose) synthesis approaches
mentioned above. But since multiplication is an irreversible function, thereby often circuits with a
signicant number of additional circuit lines result. Furthermore, these approaches do not scale very
well in particular for the multiplication function. In this paper, we present three methods that (partially)
address these drawbacks. More precisely, an adjusted multiplication specication is introduced that
enables synthesis of a multiplier with a signicant lower number of circuit lines. Even if this specication
still is only applicable for very small bit-widths, it provides an interesting insight in how to exploit
properties of a multiplier while synthesizing them as reversible circuit. Additionally, two constructive
approaches for synthesis of multipliers with very large bit-width are proposed. While the rst one is a
hierarchical method based on partial products, the second one is motivated by the divide-and-conquer-
method of Karatsubas algorithm [16]. All proposed methods apply the well-established Toffoli gate
library. The resulting circuits are evaluated with respect to number of circuit lines, number of gates,
quantum cost, and transistor costs, respectively. Furthermore, we compare the results with circuits
obtained by (general purpose) synthesis approaches. Overall, methods to generate reversible
realizations for practical relevant multiplier with large bit-widths are presented and evaluated. The
remaining paper is structured as follows. Section II introduces the basics of reversible logic. Afterwards,
the three methods to generate reversible multipliers are introduced in Section III. This also includes a
brief discussion about the advantages and disadvantages of the respective approaches. Finally,
experimental results are presented in Section IV and the paper is concluded in Section V, respectively. II.
PRELIMINARIES To keep the paper self-contained, this section introduces the basics of reversible
functions and reversible circuits. For more details we refer to the respective publications. Denition 1: A
multiple-output function f : Bn Bm isa reversible function iff (1) its number of inputs is equal to the
number of outputs (i.e. n = m) and (2) it maps each input pattern to a unique output pattern. In other
words, each reversible function is a bijection that permutes the set of input patterns. A function that is
not reversible is termed irreversible. Quite often, (irreversible) multi-output Boolean functions should
be represented by reversible circuits. This necessitates the irreversible function to be embedded into a
reversible one which requires the addition of circuit lines leading to constant inputs (i.e. inputs that are
TABLE I QUANTUM COST FOR TOFFOLI GATES
NO. OF QUANTUM COST CONTROL LINES OF A TOFFOLI GATE 0 1 1 1 2 5 3 13 4 26, if at least 2 lines are
unconnected 29, otherwise 5 38, if at least 3 lines are unconnected 52, if 1 or 2 lines are unconnected
61, otherwise 6 50, if at least 4 lines are unconnected 80, if 1, 2 or 3 lines are unconnected 125,
otherwise
assigned to a xed value) and garbage outputs (i.e. outputs that are dont care for all possible input
conditions). The minimal number of circuit lines to be added is determined by the number of the
occurrences of the most frequent output pattern [17]. To realize reversible functions some restrictions
must be considered, e.g. fanouts and feedback are not allowed [5]. This is reected in the denition of
reversible circuits. Denition 2: A reversible circuit G over inputs X ={xi|1 i n} is a cascade of
reversible gates gi, i.e. G =Td i=1 gi where d is the number of gates1. In the literature, different
reversible gates have been studied. The universal multiple control Toffoli gate [18] is the most
commonly used gate. Denition 3: Let X := {xi|1 i n} be the set of domain variables. A multiple
control Toffoli gate has the form g(C,t), where C = {xji|1 i k} X is the set of control lines and t = xl
with t 6 C is the target line. It inverts the target line iff all control lines are assigned to 1, i.e. the gate
maps Tl1 i=1 xi,xl,Tn i=l+1 xi to Tl1 i=1 xi,xlVk i=1 xji,Tn i=l+1 xi. If no control lines are given (C is
empty), then the target line is inverted, i.e. the input vector of the gate is mapped to Tl1 i=1 xi,xl
1,Tn i=l+1 xi. In the following, we refer to multiple control Toffoli gates for brevity as Toffoli gates.
Furthermore, a Toffoli gate with no control line (with k control lines) is also called NOT gate (CkNOT
gate). To determine the effort to realize a reversible circuit, the following cost models are applied
depending on the target technology: Line count denotes the number of lines the circuit uses (in
particular for the application in the domain of quantum computing this is an important cost criterion
since the numberofcircuitlinescorrespondtothenumberofqubits so far a very restricted resource).
Gate count denotes the number of gates the circuit consists of (i.e. d). Quantum cost denotes the
effort needed to transform a reversible circuit to a quantum circuit. Table I shows the quantum cost for
a selection of Toffoli gate congurations as introduced in [19] and further optimized e.g. in [17]. 1For
tuples, we are using the T-symbol which is analogously dened to the sum-symbolP: T0 i=0 xi = x0;Tn+1
i=0 xi =Tn i=0 xi,xn+1.
010
100
Fig. 1. Toffoli circuit
As can be seen, gates of larger size are considerably more expensive than gates of smaller size. The sum
of the quantum cost for all gates denes the quantum cost of the whole circuit. Transistor cost
denotes the effort needed to realize a reversible circuit in CMOS according to [20]. The transistor cost of
a Toffoli gate is 8c where c is the number of control lines. Example 1: Fig. 1 shows a Toffoli circuit with 3
circuit lines, 6 gates, quantum cost of 10, and transistor cost of 56, respectively. The control lines are
thereby denoted by l, while the target lines are denoted by . The annotated values illustrate the
computation performed by this circuit. Finally, the following denition introduces the denotation of
controlled functions. Denition 4: Let : Bn Bn be a reversible function that is realized by the
reversible circuit G = Td i=1 gi. The controlled function c is described by the circuit c G which is
obtained by adding the circuit line c to the control set C of each gate g in the circuit G. In this way, e.g. a
controlled increaser c += can be obtained from an increaser += by adding the control bit c to every gate
of the circuit, that represents this increaser.
III. REALIZING MULTIPLICATION IN REVERSIBLE LOGIC In this section, we introduce methods to efciently
realize multiplication in reversible logic. In a straight-forward way, a multiplier can be synthesized by
specifying the underlying function in terms of a truth table, binary decision diagram, exclusive-sum-of-
products, or similar descriptions, respectively, and pass this to an appropriate synthesis approach (e.g.
[9], [11], [12]). Since multiplication is an irreversible function, thereby often circuits with a signicant
number of additional circuit lines result (even if in case of [9] circuits with minimal number of lines can
be achieved). But for the particular case of multiplication, it is possible to reduce the number of circuit
lines if an adjusted function specication is used.Thecorrespondingapproachisintroducedintherstpart
of this section. However, even then multipliers can be realized only for very small bit-widths. Thus, two
further approaches are introduced that enable synthesis of multipliers with scalable bit-widths. Finally,
the newly proposed realizations are discussed.
A. Multiplier with Sub-minimal Circuit Lines As mentioned above, usually additional circuit lines are
needed to embed an irreversible function (like multiplication) into a reversible one. More precisely, lines
must be added so that at least dlog2()e garbage outputs are available, where is the maximum
number of times an output pattern is
TABLE II RESULTS OF A 3-BIT MULTIPLICATION product (result) dlog2 e factors 0 (zero) 15 4 0k,k0 6 4
2 16,23,32,61 12 4 2 26,34,43,62 4 3 2 14,22,41 2 2 1 12,21 3,5,7 similar to 2 8 2 1 24,42
10,14,15,18, 20,21,24,28, 30,35,42 similar to 8 1 1 0 11 9 1 0 33 16,25,36,49 similar to 9
repeated in the truth table (see [17])2. Thus, keeping as small as possible helps reducing the number
of circuit lines. In the following, an adjusted specication of the multiplication function is presented that
exploits this observation. The most frequent output pattern in a multiplier is the binary encoding of
zero. The product is zero, iff (at least) one factor is zero. Thus, an n-bit multiplication produces = 22n
1 = 2n+1 1 zeros. Hence, dlog2(2n+1 1)e= (n +1) lines with garbage outputs are required. All other
possible products (i.e. output patterns) are less frequent. The main idea of this approach is to reduce the
occurrences of the zero-output by means of an additional indicator output. As a result, the value of is
signicantly decreased and the multiplier can be realized with less circuit lines. More precisely, an
indicator output is added, which becomes assigned to 1 iff the product of the multiplication is zero. In
contrast, all primary outputs can be arbitrarily assigned in this case. In doing so, the result zero is
obtained by measuring the indicator output, while all other results still can be obtained
fromtheremainingprimaryoutputs.Thus,thebinaryencoding of the zero is not longer applied to obtain the
minimal number of garbage lines. Instead the next most frequent output pattern is used for that.
Therewith, the number of garbage lines can be asymptotically reduced by one half. Example 2: Consider
a 3-bit multiplication. The possible products and their occurrences (i.e. ) are depicted in Table II. The
respective output patterns (ordered in terms of a truth table) are shown in Table III. Since zero is the
most frequent product (in total occurring 15 times), at least dlog2(15)e= 4 garbage lines are
conventionally needed. In contrast, encoding the zero output by a separate indicator output, 6 and 12
become the most frequent products each with only 4 occurrences. Hence, onlydlog2(4)e= 2 additional
garbage outputs are needed to realize this function. As a result, this encoding only requires 3 additional
outputs (the indicator output and the two garbage outputs) in comparison to the 4 additional outputs
required by the conventional method. Column sub-minimal in Table III shows a possible embedding
exploiting this encoding. Having this adjusted specication, any truth table-based synthesis approach
(e.g. [9]) can be applied to realize the multiplier. However, since this approach still relies on a
2Note that this states not only for truth tables but for any other function description as well.
TABLE III ENCODING OF A 3-BIT-MULTIPLICATION
result encoded a b decimal conventional sub-minimal 000 000 0 000000 - - - - 000000 1- 000 001 0
000000 - - - - 000001 1- 000 010 0 000000 - - - - 000010 1- . . . . . . . . . . . . . . . 001 000 0 000000 - - - -
001000 1- 001 001 1 000001 - - - - 000001 0- 001 010 2 000010 - - - - 000010 0- . . . . . . . . . . . . . . . 011
111 21 010101 - - - - 010101 0- 100 000 0 000000 - - - - 100000 1- . . . . . . . . . . . . . . . 111 111 49 110001 -
- - - 110001 0-
truth table description, it is only applicable for multiplier of very small bit-widths. In the following two
scalable synthesis methods are proposed. B. Hierarchical Method The realization of multiplier by a
hierarchical method using controlled increasers is described in this section. A common way to multiply
two factors a =Pn1 i=0 ai 2iand b =Pn1 i=0 bi2i is to compute the partial products and add them
together, i.e. ab =Pn1 i=0ai Pn1 j=0 bj 2j2i. Thatis, the respective bit of bj multiplied by the
respective power of 2 is added to the product, iff the respective bit ai is assigned to 1. This can easily be
realized by controlled functions (or more precisely controlled increasers as sketched at the end of
Section II). Thus, using the hierarchical method an n-bit multiplier is realized by n single n-bit controlled
increasers. The respective algorithm for an n-bit multiplication with the factors a = Tn1 i=0 ai and b =
Tn1 i=0 bi as well as the product c =T2n1 i=0 ci is depicted in Fig. 2. Here, the ith controlled increaser
is controlled by ai. It conditionally adds the value of b to Tn1+i j=i ci, i.e. to the n bits of the product c
beginning from the i-th bit. The lower bits do not have to be considered, since the j-th bit of the product
is only modied until the j-th controlled increaser. Beyond that, the value remains unchanged.
Therefore, the controlled increasers can sequentially be realized by an implicit bit-shift after every
increase and without concern for the lower bits. Also, the i-th increaser writes the carry-over in the n + i-
th position of the product, which has not been used so far and, thus, holds the value 0. Example 3:
Consider a 3-bit hierarchical multiplication with the factors a = a2a1a0 and b = b2b1b0 as well as the
product c = c5c4c3c2c1c0. In total, three controlled increasers
1 for i = 0 to n1 2 { 3 Tn+i j=i ci
ai += Tn1 i=0 bi
4}
Fig. 2. Hierarchical method
are needed to realize this multiplication. The rst controlled increaser is controlled by a0. It conditionally
adds the second factor b to the three least signicant bits of the product c, i.e. to c2c1c0. It also writes
the carry-over into c3. The second controlled increaser is controlled by a1. It conditionally adds b to
c3c2c1.Againthecarry-overiswrittenintothenextproductbit c4. Finally, the third controlled increaser is
controlled by a2. It conditionally adds the second factor b to c4c3c2. The carry-over is written into the
most signicant bit c5 of the product. C. Karatsuba Method This section describes the realization of
multipliers in reversible logic based on the divide-and-conquer-method of Karatsubas algorithm [16].
The idea behind the Karatsuba algorithm is to realize the multiplication by multiplying two factors with
smaller bit-width and additionally perform some less expensive operations. Consider an n-bit
multiplication with n = 2 k. Both factors (e.g. a =P2k1 i=0 ai 2i) are partitioned in an upper half (a
:=P2k1 i=k ai 2ik) and a lower half (a :=Pk1 i=0 ai 2i) such that a = a2k +a. With this representation
the following equations are deducible: ab =a2k + ab2k + b = ab22k + (ab + ab)2k + ab = ab22k +
(ab + ab + ab + ababab)2k + ab = ab22k + (a(b + b) + a(b + b)abab)2k + ab = ab22k + ((a
+ a)(b + b)abab)2k + ab These equations show that a (2k)-bit multiplication can be realized by
three k-bit multiplications, some additions, subtractions, and bit-shifts, respectively. In reversible logic,
additionally a number of circuit lines are required (as already
shownin[21]).However,thelattercanbereducedbychoosing appropriate targets of the intermediate
results as well as good orderings of the respective operations. Fig. 3 shows an optimized approach to
generate reversible multiplier based on this observation (which requires less additional circuit lines than
the method of [21]). The general idea is illustrated by the following example. Since the Karatsuba
method is not applicable for very small bit-widths, the variable turningPoint denotes the bit-width below
which the hierarchical multiplication is used. Beyond the turningPoint the Karatsuba method is used for
multiplication. Example 4: Consider an 8-bit Karatsuba multiplication and turning point t = 6. Since 8 is
greater than t and even, the two conditionals (lines 1 and 4) do not hold and k = 4 is computed (line 7).
Then, new variables d,e,f are initialized as shown in line 8. This leads to 44+4 = 20 garbage lines. Then,
the two smaller multiplications c = c7c6c5c4c3c2c1c0 = a3a2a1a0b3b2b1b0 = ab (line 9) and c = ab
(line 10) are performed, respectively. Since these are 4-bit multiplications and the turning point in this
example is t = 6, these multiplications are realized using the hierarchical approach described in the
previous section. Afterwards, the result of these multiplication is directly assigned to the bits of the
product. Furthermore, this result will be used later to modify the product of the sums that are
computed next. These sums are d = d4d3d2d1d0 = a7a6a5a4+a3a2a1a0 = a+a (line 11)
1 if (n < turningPoint) 2 c = MULTH (a,b) 3 4 if (n%2 = 1) 5 init an,bn,c2n,c2n+1 with 0 6 7 k := bn 2c8 init
d,e (k + 1 bits ) , f (2k + 2 bits ) with 09 c = MULTK(a,b) 10 c = MULTK (a,b) 11 d = a + a 12 e = b + b 13 h =
MULTK(d,e) 14 h = c 15 h = c 16 T3k+3 i=k ci+ = h
Fig. 3. Karatsuba method
and e = b + b (line 12). They can be performed by copying the rst summand to the (still uninitialized)
target and then increasing it by the second summand. The results of these two sums are multiplied to
get the third sub-product h = de (line 13). Again, since the turning point is greater than the size of the
factors, this multiplication is performed by the hierarchical approach. After that, this third sub-product
must be modied by subtracting the two earlier computed products from lines 9 and 10 as can be seen
in lines 14 and 15 . Finally, the result is obtained by performing the addition of this value to the product
using an implicit bitshift of k bits (line 16).
D. Discussion As reviewed in Section II, there are different cost metrics to judge the quality of reversible
circuits, namely the number of circuit lines, the number of gates, the quantum cost, and the transistor
cost, respectively. In this section, the costs of the circuits for an n-bit multiplier obtained by the
proposed methods are briey discussed. Experimental results are afterwards given in Section IV. The
sub-minimal approach signicantly reduces the minimal number of garbage lines using approximately n
2 garbage lines (instead of n+1 garbage lines that the conventional approach minimally requires). The
number of gates, the quantum cost, and the transistor costs depend on the applied synthesis method.
For the hierarchical approach, an n-bit adder without garbage lines is used, which needs (5 n 5) CNOT
and (2n1) C2NOT gates [22]. Accordingly, a controlled adder consists of (5n5) C2NOT gates and
(2n1) C3NOT gates. The hierarchical method realizes a multiplier with n controlled adders. The rst
controlled addition can therefore be replaced with a controlled duplication (since the product initially is
assigned to 0 and, thus, nothing has to be added). Consequently, this implementation of the hierarchical
method leads to circuits with (n 1) controlled adders and one controlled duplication (which is done
with n CNOT gates). In total, circuits result including (5n2 9n + 5) C2NOT gates as well as (2n23n+1)
C3NOT gates. Therefore, these circuits have quantum costs of 51n284n+38 and transistor costs of
128n2 216n + 104, respectively. Furthermore, this approach uses separate lines for primary inputs (n
lines for
each factor) and primary outputs (2n lines for the product), i.e. in total 4n lines. The Karatsuba
multiplication needs (4dn 2e+ 5) garbageoutputs per recursion step. Thus, if t is chosen as turning point
and Tt represents the number of circuit lines needed for a t-bit hierarchical multiplier, circuits with
approximately 3log2(n)log2(t) Tt +Plog2(n)log2(t)1 i=0 3i (4dn 2e+ 5) n t log2(3) (Tt +4t)+5 n t
4n5 garbage outputs arerequired (additionally to the 2n lines for the primary inputs). The number of
gates, the quantum cost, and the transistor costs are in O(nlog2(3)). In summary, the Karatsuba
approach asymptotically has less quantum cost and transistor costs than the hierarchical approach. But,
the hierarchical method requires less circuit lines. The sub-minimal approach leads to circuits with the
lowest line count, but is limited by the truth table-based specication. IV. EXPERIMENTS The proposed
synthesis methods for multiplier have been implemented in C++. In this section, we provide
experimental results generated by these methods and compare them to realizations obtained by general
purpose approaches (namely the transformation-based [9], the ESOP-based [11], and the BDD-based
[12] approach, respectively). For the sub-minimal specication, we used the transformation-based
approach [9] to synthesize the circuits. The turning point for the Karatsuba approach was set to t = 8.
The timeout (denoted by TO) was set to 1000 CPU seconds. The results are presented in Table IV.
Besides the respective bit-width and the resulting number of primary inputs (PI), the line count (LC), the
gate count (GC), the quantum cost (QC), and the transistor costs (TC) are listed for each method,
respectively. Additionally, the time (in CPU seconds) required to generate the results is listed in column
Time. First of all, the results conrm that using the adjusted specication from Section III-A, multiplier
with a lower number of circuit lines can be generated. But due to the truth tablebased description, the
method works for very small bit-widths only. A similar behavior can be observed, if general purpose
synthesis approaches based on ESOP and BDDs are applied. Indeed, somewhat larger multiplier can be
synthesized, but practical relevant bit-widths (e.g. a 32-bit or 64-bit multiplier) cannot be generated.
This can be explained by the fact that in particular for the multiplication no efcient representation as
ESOP or as BDD, respectively, exists. In contrast, the hierarchical method and the Karatsuba approach
enable the synthesis of reversible multipliers for nearly arbitrary sizes. As discussed in the last section,
the Karatsuba method requires more circuit lines, but leads to smaller realizations with respect to
quantum cost and transistor cost (in particular for large bit-width). For a better overview, the results of
both approaches are additionally illustrated in Fig. 4. V. CONCLUSIONS In this paper, we introduced
three methods for multiplier synthesis that particularly address the drawbacks of previous approaches
(e.g. the large number of circuit lines in the resulting realizations as well as the poor scalability). We
showed that multiplier with a lower number of circuit lines can be obtained by using an adjusted
specication of the underlying function. Besides that, two constructive approaches for synthesis of
multipliers with very large bit-width are proposed. Experiments conrmed that using these methods,
multipliers with large bit-widths can efciently be synthesized,
whilepreviousapproachesaswellasgeneralpurposesynthesis methods do not scale very well on
multiplication. ACKNOWLEDGMENT This work was supported by the German Research Foundation
(DFG) (DR 287/20-1) and the German Academic Research Foundation (DAAD). REFERENCES [1] V. V.
Zhirnov, R. K. Cavin, J. A. Hutchby, and G. I. Bourianoff, Limits to binary logic switch scaling a
gedanken model, Proc. of the IEEE, vol. 91, no. 11, pp. 19341939, 2003. [2] R. Landauer, Irreversibility
and heat generation in the computing process, IBM J. Res. Dev., vol. 5, p. 183, 1961. [3] C. H. Bennett,
Logical reversibility of computation, IBM J. Res. Dev, vol. 17, no. 6, pp. 525532, 1973. [4] B. Desoete
and A. D. Vos, A reversible carry-look-ahead adder using control gates, INTEGRATION, the VLSI Jour.,
vol. 33, no. 1-2, pp. 89104, 2002. [5] M. Nielsen and I. Chuang, Quantum Computation and Quantum
Information. Cambridge Univ. Press, 2000. [6] P. W. Shor, Algorithms for quantum computation:
discrete logarithms and factoring, Foundations of Computer Science, pp. 124134, 1994. [7] L. M. K.
Vandersypen, M. Steffen, G. Breyta, C. S. Yannoni, M. H. Sherwood, and I. L. Chuang, Experimental
realization of Shors quantum factoring algorithm using nuclear magnetic resonance, Nature, vol. 414,
p. 883, 2001. [8] V. V. Shende, A. K. Prasad, I. L. Markov, and J. P. Hayes, Synthesis of reversible logic
circuits, IEEE Trans. on CAD, vol. 22, no. 6, pp. 710722, 2003. [9] D. M. Miller, D. Maslov, and G. W.
Dueck, A transformation based algorithm for reversible logic synthesis, in Design Automation Conf.,
2003, pp. 318323. [10] P. Gupta, A. Agrawal, and N. K. Jha, An algorithm for synthesis of reversible
logic circuits, IEEE Trans. on CAD, vol. 25, no. 11, pp. 23172330, 2006. [11] K. Fazel, M. A. Thornton,
and J. E. Rice, ESOP-based Toffoli gate cascade generation, in PACRIM, 2007, pp. 206209. [12] R. Wille
and R. Drechsler, Bdd-based synthesis of reversible logic for large functions, in DAC, 2009, pp. 270
275. [13] H. Thapliyal and M. B. Srinivas, Novel reversible multiplier architecture using reversible TSG
gate, in International Conference on Computer Systems and Applications, 2006, pp. 100103. [14] M.
Haghparast, S. Jassbi, K. Navi, and O. Hashemipour, Design of a novel reversible multiplier circuit using
HNG gate in nanotechnology, World Applied Sciences Journal, vol. 3, no. 6, pp. 974978, 2008. [15] M.
Islam, M. Rahman, Z. Begum, and M. Haz, Low cost quantum realization of reversible multiplier
circuit, Information Technology Journal, vol. 8, no. 2, pp. 208213, 2009. [16] A. Karatsuba and Y.
Ofman, Multiplication of many-digital numbers by automatic computers, Doklady Akad. Nauk SSSR,
vol. 145, 1963. [17] D. Maslov and G. W. Dueck, Reversible cascades with minimal garbage, IEEE Trans.
on CAD, vol. 23, no. 11, pp. 14971509, 2004. [18] T. Toffoli, Reversible computing, in Automata,
Languages and Programming, W. de Bakker and J. van Leeuwen, Eds. Springer, 1980, p. 632, technical
Memo MIT/LCS/TM-151, MIT Lab. for Comput. Sci. [19] A. Barenco, C. H. Bennett, R. Cleve, D.
DiVinchenzo, N. Margolus, P. Shor, T. Sleator, J. Smolin, and H. Weinfurter, Elementary gates for
quantum computation, The American Physical Society, vol. 52, pp. 34573467, 1995. [20] M. K.
Thomson and R. Gluck, Optimized reversible binary-coded decimal adders, J. of Systems Architecture,
vol. 54, pp. 697706, 2008. [21] L. A. B. Kowada, R. Portugal, and C. M. H. Figueiredo, Reversible
Karatsubas algorithm, The Journal of Universal Computer Science, vol. 12, no. 5, pp. 499511, 2006.
[22] Y. Takahashi, S. Tani, and N. Kunihiro, Quantum addition circuits and unbounded fan-out, in Asian
Conference on Quantum Information Science, 2009.
TABLE IV EXPERIMENTAL RESULTS
Sub-minimal (+ transformation-based [9]) Minimal (+ transformation-based [9]) Bit-width PI LC GC QC TC

Time LC GC QC TC Time 1 2 3 3 11 32 <0.01 4 2 6 24 0.01 2 4 6 158 1126 2184 0.01 7 344 3294 5032 0.02
3 6 9 2323 24904 32712 0.36 TO 4 8 TO TO
ESOP-based [11] BDD-based [12] Bit-width PI LC GC QC TC Time LC GC QC TC Time 1 2 4 1 5 16 0.01 8 5 9

32 0.01 2 4 8 6 72 128 0.01 17 20 44 160 0.01 3 6 12 36 591 720 0.02 29 49 117 448 0.01 4 8 16 169 3693
3744 0.06 47 103 279 1064 0.01 8 16 TO 609 2798 8842 33840 0.06 16 32 TO 531001
2806841 9225345 33932664 957,26 32 64 TO TO
Hierachical Karatsuba Bit-width PI LC GC QC TC Time LC GC QC TC Time 1 2 4 1 5 16 <0.01 4 1 5 16 <0.01

2 4 8 10 74 184 <0.01 8 10 74 184 <0.01 3 6 12 33 245 608 <0.01 12 33 245 608 <0.01 4 8 16 70 518 1288
<0.01 16 70 518 1288 <0.01 8 16 32 358 2630 6568 0.01 54 517 2437 7032 <0.01 16 32 64 1606 11750
29416 0.01 176 2304 9696 29352 <0.01 32 64 128 6790 49574 124264 0.01 554 8492 34000 105296
0.01 64 128 256 27910 203558 510568 0.01 1712 28710 111966 351096 0.04 128 256 512 113158
824870 2069608 0.27 5234 92672 355972 1124432 0.12 256 512 1024 455686 3320870 8333416 0.61
15896 291174 1108206 3516312 0.38 512 1024 2048 1828870 13326374 33443944 1.34 48074 899912
3405340 10835696 0.98 1024 2048 4096 7327750 53391398 133996648 5.94 144992 2752590
10377606 33081336 1.43
(a) Line count (LC) (b) Gate count (GC)
(c) Quantum cost (QC) (d) Transistor cost (TC)
Fig. 4. Comparison between Hierachical method and Karatsuba method

New Be Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New Be Notes

Uploaded by

Copyright:

Available Formats

Backend.

Abstract A physical design ow consists of producing a production-worthy layout from a gate-level

Logical and Physical Design:A ow Perspective 3

Routing Extraction Clock tree Placement

Figure 7.1. Flows until mid-1990s.

1997 2001 2006 2009 2012

Figure 7.2. Coupling capacitance dominates inter-layer capacitance.

Logical and Physical Design:A ow Perspective 5

Logical and Physical Design:A ow Perspective 7

Logical and Physical Design:A ow Perspective 9

Logical and Physical Design:A ow Perspective 11

Logical and Physical Design:A ow Perspective 13

Figure 7.3. Placement renement.

Resynthesis and technology remapping.

Area and/or power recovery.

Logical and Physical Design:A ow Perspective 17

7.4.3.2 Buering. Buering serves multiple functions:

Logical and Physical Design:A ow Perspective 19

Figure 7.4. Clock tree and P/G network renement.

Logical and Physical Design:A ow Perspective 21

Figure 7.5. Noise and delay crosstalk eects.

0.0 0.2 0.4 0.6 0.8 time (ns)

victim input aggressor input

0x cap model 2x cap model

same direction opposite direction victim output:

Figure 7.6. RC extraction for crosstalk analysis.

Logical and Physical Design:A ow Perspective 23

Figure 7.7. Fixing cross-talk at DR.

Figure 7.8. IR drop causes delay.

Logical and Physical Design:A ow Perspective 25

Figure 7.9. Net failure due to electromigration.

Logical and Physical Design:A ow Perspective 27

GND shield clock signal VDD shieldVDD planeGND plane

GND shield bus signals VDD shield

Figure 7.10. Inductance avoidance.

Figure 7.11. Placement/synthesis/routing interaction.

Logical and Physical Design:A ow Perspective 29

To realize such a ow, some of the following questions must be answered:

What should drive a hierarchical design ow: oorplaning or synthesis?

How do we perform timing-driven oorplanning?

How do we re-dene gate-level synthesis to facilitate physical optimization?

How do we derive block-level timing budgets?

Must we dene new clock tree design methodologies in such a ow?

TABLE I QUANTUM COST FOR TOFFOLI GATES

Fig. 1. Toffoli circuit

TABLE III ENCODING OF A 3-BIT-MULTIPLICATION

1 for i = 0 to n1 2 { 3 Tn+i j=i ci

Fig. 2. Hierarchical method

Fig. 3. Karatsuba method

TABLE IV EXPERIMENTAL RESULTS

Sub-minimal (+ transformation-based [9]) Minimal (+ transformation-based [9]) Bit-width PI LC GC QC TC

ESOP-based [11] BDD-based [12] Bit-width PI LC GC QC TC Time LC GC QC TC Time 1 2 4 1 5 16 0.01 8 5 9

Hierachical Karatsuba Bit-width PI LC GC QC TC Time LC GC QC TC Time 1 2 4 1 5 16 <0.01 4 1 5 16 <0.01

(a) Line count (LC) (b) Gate count (GC)

(c) Quantum cost (QC) (d) Transistor cost (TC)

Fig. 4. Comparison between Hierachical method and Karatsuba method

You might also like