Professional Documents
Culture Documents
The tutorial introduces the partitioning with applications to VLSI circuit designs. The
problem formulations include two-way, multiway, and multi-level partitioning, parti-
tioning with replication, and performance driven partitioning. We depict the models
of multiple pin nets for the partitioning processes. To derive the optimum solutions,
we describe the branch and bound method and the dynamic programming method
for a special case of circuits. We also explain several heuristics including the group
migration algorithms, network flow approaches, programming methods, Lagrange
multiplier methods, and clustering methods. We conclude the tutorial with research
directions.
175
176 S.-J. CHEN AND C.-K. CHENG
partitioning methodology can localize the modifi- decompose the designs into hardware and
cations and reduce the complexity. software.
Furthermore, a good partitioning tool can Management of design reuse For huge designs
decrease the production cost and improve the especially system-on-a-chip, we have to manage
system performance. With the advance of fabri- design reuse. Partitioning can identify clusters
cation technologies, the cost of a transistor drops of the netlist and construct functional modules
while the cost of input/output pads remains fairly out of the clusters.
constant. Consequently, the size of the interface
While partitioning is a tool required to manage
between partitions, e.g., between chips, deter-
mines a significant portion of the manufacturing
huge systems in many fields such as efficient
storage of large databases on disks, data mining,
expenses. And the quality of the partitioning
and etc., in this tutorial, we focus our efforts on
has strong effect on production cost. Further-
partitioning with applications to VLSI circuit
more, in submicron designs, interconnection de-
designs. In the next section, we describe the nota-
lays tend to dominate gate delays [8]; therefore
tions for the tutorial. In section three, the formu-
system performance is greatly influenced by the
lations of the partitioning problems are stated.
partitions.
Section four covers the models for mutiple pin
Partitioning has been applied to solve the
nets. Section five depicts the partitioning algo-
various aspects of VLSI design problems [5, 36]:
rithms. The tutorial is concluded with research
Physical packaging Partitioning decomposes directions.
the system in order to satisfy the physical pack-
aging constraints. The partitioning conforms
to a physical hierarchy ranging from cabinets, 2. PRELIMINARIES
cases, boards, chips, to modular blocks.
Divide and conquer strategy Partitioning is In this section, we establish notations used and
used to tackle the design complexity with a formulate the partitioning problems addressed
divide and conquer strategy [21]. This strategy in our approaches. A circuit is represented by a
is adopted to decompose the project between hypergraph, H(V,E), where the vertex set
team members, to construct a logic hierarchy for V- {vii i= 1,2,...,n} denotes the set of modules
logic synthesis, to transform the netlist into phy- and the hyperedge set E={e#lj 1,2,...,m} de-
sical hierarchy for floorplanning, to allocate cells notes the set of nets. Each net e- is a subset of V
into regions for placement and RLC extraction, with cardinality le.l > 2. The modules in e. are
and manipulate hierarchies between logic and called the pins of e-.
layout for simulation. The hypergraph representation for a circuit with
System emulation andrapidprototyping One ap- 9 modules and 6 signal nets is shown in Figure 1,
proach for system emulation and prototyping where nets e, e3 and e5 are two-pin nets, net e6 is a
is to construct the hardware with field program- three-pin net, and nets e2 and e4 are four-pin nets.
mable gate arrays. Usually, the capacity of these When the circuit has only two pin nets, we can
field programmable gate arrays is smaller than simplify the representation to a graph G(V, E). A
current VLSI designs. Thus, these prototyping net connecting modules v; and v# is represented by
machines are composed of a hierarchical struc- e o. with a connectivity ci-. We set co.-0 if there is
ture of field programmable gate arrays. A par- no net connecting modules .F and v#. We shall show
titioning tool is needed to map the netlist into later that for certain formulations we replace
the hardware [110]. multiple pin nets with models of two pin nets.
Hardware and software codesign For hardware The replacement is performed when the partition-
and software codesign, partitioning is used to ing algorithm is devised for graph models.
VLSI PARTITIONING 177
v2
Vl v8
v7 v9
(i) Module Size and Net Connectivity Each mod- the direction of the nets makes a difference in the
size of a partition .
ule V is attached with a size si in R +, positive
real numbers. We define S(Vj) viv si to be the
Each net ei is attached with
a connectivity ci in R +. By default, ci 1. For a
bus of multiple signal lines, we can represent the
process. We characterize the pins of each net into
two types: source and sink. A directed net e.
is denoted by (a,bz) where a.c V are the source
pins of the net and bi c V are the sink pins of the
net. We assume that laiLJ bil >_ 2, lail >_ and
bus with a net ei of connectivity ci equal to the num- Ibl > 1. Usually, each net has one source pin and
ber of lines. We can also assign higher weights for multiple sink pins. However, some nets may have
some important nets, this will enable us to keep multiple sources which share the same interconnect
the modules of these nets in the same partition. line. Furthermore, one pin can be both a source
In this tutorial, we will assume that circuits are pin and sink pin of the same net. Therefore, a
represented as hypergraphs except when stated and bg may have a nonempty intersection.
otherwise, hence, the terms circuit, netlist, and For two disjoint vertex sets X and Y, we shall
hypergraph are used interchangeably throughout use E(X-+ Y) to denote the directed cut set from
the tutorial. X to Y. Net set E(X--, Y) contains all the nets
eg (a,bg) such that X intersects the source pin
(ii) Partitions and Cuts The set of hyperedges con- set a; and Y intersects the sink pin set b, i.e.,
necting any two-way partition (V1, V2) of two
E(X-- g)={ele=(a,bi), aYO, bY:/:O}.
disjoint vertex sets V1 and V2 is denoted by a cut
We use the function C(X--, Y) to denote the to-
E(V, V2): {e-C E 0 < le.N vii and 0 < tal cut count of the nets in E(X--, Y), i.e.,
i.e., e# E(V, V2) if there exist some pins of ei in
C(X --+ Y) -eiE(X-Y) Ci"
v and some different pins of e- in v2. We define
C(V, V2)- -,e,E(v,,v2)ci to be the cut count of
the partition (V, V2). (iv) Performance Driven Partitioning In perfor-
mance driven partitioning [106], modules are
For a multiway partition (V, V2,..., Vk) where
distinguished into two types: combinational ele-
k>2, a cut E(V,Vz,...,V.I):{eiEIi s.t.
ments and globally clocked registers. In illustra-
0 < le/ Vii < le#l}. For each subset Vi, we de-
external
tion, we shall use circles to represent the
note its cut set E(Vi)
combinational elements and rectangles to repre-
{e-EI0 leCq Vii le/l}. We denote its adja-
< <
sent the registers in figures (Fig. 13). Each module
cent net set to be the nets with some pin con-
tained in Vi, i.e., I(Vi)-{eil [ei Vii > 0}. v. has an associated delay d..
A path p length k from a module vi to a module
(iii) Replication Cuts and Directed Cuts For repli- v- is a sequence (Vpo, Vp,..., Vp:) of modules such
cation cuts and performance driven partitioning, that vi Vpo, v# Vp: and for each 1,2,..., k},
178 S.-J. CHEN AND C.-K. CHENG
modules Vp(l-1), Vpl are a source pin and a sink pin C(V1, V2) is minimized, i.e.,
of a net in E, respectively.
minvsEVl,v, cV2 C(V1, V2) (1)
(v) Clustering Given a hypergraph H(V,E), where V1 and V2 are disjoint and the union of the
highly connected modules in V can be grouped two sets is equal to V.
together to form some single supermodules called This partitioning is strongly related to a linear
clusters. After this process, a clustering F= {V1, placement problem. In a linear placement, we have
V2,..., Vz:} of the original hypergraph H is Vl equally spaced slots on a straight line (Fig. 2).
obtained and a contracted (i.e., coarser) hyper- Modules vs and v are fixed at the two extreme
graph Hr(Vv Er) is induced, where Vr {v ends, i.e., vs on the first slot (left end) and v on the
v,..., v}. For every ejCE, the contracted net last slot (right end). The goal is to assign all mod-
e} CEr if ejrl_>2, where ejr-{vrilejffl Vi=/= ules to distinct slots to minimize the total wire
that is, e spans the set of clusters containing
modules of ej. A contracted hypergraph, of course,
length. Let us use xi to denote the coordinate of
module vi after it is assigned to the slot. The length
can be used to induce another coarser contracted of a net ei can be expressed as the difference of
hypergraph based on the same clustering process. the maximum coordinate and the minimum coordi-
On the other hand, a contracted hypergraph nate of the modules in the net, i.e., maxv.ce, xj--
Hr(Vr, Er) can be unclustered to return to a finer minvkEe, Xk. The total wire length can be expressed
hypergraph H(V, E). as follows.
modules locating at the two extreme ends of a linear modules. Thus, if the min-cut cannot provide any
placement. Then, there exists an optimal linear nontrivial solution, we may adopt the cost ratio
placement solution such that all modules in V2 are cut to perform another trial.
on the slots right of all modules in V1 (Fig. 2). In cost ratio cut, we fix two modules v and vt at
two different sides. Our objective is to find a vertex
Thus, we can use the min-cut to partition a
set A to minimize a cost ratio function:
linear placement into two smaller problems and
still maintain optimality. Conceptually, we can
conceive that modules in V1 or V2 have stronger C(A, V- A {Vs}) C(A, {Vs})
internal connection within the set than its mutual S(A)
connection to the other set. Thus, if the span of
modules in V1 and in V2 are mixed in a linear where vertex set A does not contain v and v.
placement, we can slide all modules in V to the Vertex set A is non-empty, i.e., S(A) > O.
left and all modules in V2 to the right to reduce Cost ratio cut is also strongly related to a linear
the total wire length. In fact, this is the procedure placement. Assuming that all nets are two pin nets,
to prove the theorem. we can derive the following theorem [22]:
The min-cut with no size constraints can be TrtEOREM 3.2 Given a graph G(V, E) with modules
found in polynomial time using classical maximum v and v in V, let (VI, V2) be an optimal cost ratio
flow techniques [1]. However, it may happen that cut partition. There exists an optimal linear place-
the optimal solution separates only vs or vt from ment solution such that all modules in A are on
the rest of the modules, i.e., V {vs} or V2--{l:t}. the slots left of all modules in V-A- {v.}.
This result is very likely to occur because most
VLSI basic modules have very small degrees Conceptually, we can conceive that C(A, V-
of connecting nets (e.g., the degree of a 3-input A-{v}) is the force to pull A to the right
NAND gate 4). and C(A, {v}) is the force to push A to the left.
The denominator S(A) is the inertia of the set A.
A set A with the minimum cost ratio moves with
3.1.2. Minimum Cost Ratio Cut the fastest acceleration toward left end of the slots
The cost ratio cut formulation supplies a partition Example In Figure 3, the circuit contains six
different from the min-cut that separates two fixed modules. The optimum cost ratio cut solution has
v v3
.’
v2
A {11, 1:2, 1:3}. The cost ratio value is two pins, we can derive that net e; belongs to the
cut set E(V, V2) with a 0.5 probability (Fig. 4).
C(A, V- A {vs}) C(A, {Vs}) 4-3 Similarly, we can derive that for a net ei of k pins
S(A) 3 3 (k > 2), the probability that net e; belongs to cut
(4) set E(V, V2) is (2 k- 2)/2 k. This probability is
larger than 0.5 and approaches one as k increases.
The cost ratio value of any other choice of set A is In other words, the expected cut count C(V1, V2) is
larger than expression (4). equal to or larger than half the number of nets.
The cost ratio cut solution can be found in For example, a circuit of one million modules usual-
polynomial time for a special case of serial parallel ly has an asymptotic number of nets, i.e., IEI
graphs [22]. We are unaware of algorithms for O(I V I)= 1,000,000. The expected cut count would
general cases. Note that, the solution may have be C(V, V2)>_ 500,000. This number is much
V-A-{v,} equal to set {vt}. In such case, the worse than the results we can achieve. In practice,
partitioning result is not useful for decomposing the cut counts on circuits of a million of mod-
the circuit. ules are usually no more than several thousands
-
[34, 36]. In other words, the probability that a net
belongs to a cut set is small, below one percent
3.1.3. Min-cut with Size Constraints for a circuit of one million gates.
Suppose the two bounds of partitioned sizes
For min-cut with size constraints, we have lower
are not equal, Sz S,. Using the proposed random
and upper bounds on the partition size $I and
graph model, the expected cut count C(V, V2) is
S,, where 0 < $/_< S, < S(V) and Sz+ S, S(V).
proportional to the product of two sizes, i.e.,
The bipartitioning problem is to divide vertex set
V into two nonempty partitions V1, V2, where
S(V) S(V2). Consequently, the expected cut
count is smallest if the size of one partition ap-
V1 C? V2 (3 and V U V2 V, with the objective of proaches the upper bound S(Vi)=S, and the
minimizing cut count C(V, V2) and subject to the
size of another partition approaches the lower
following size constraints:
bound S(V-) Sz. In practice, we do observe this
St<_S(Vh) <_S, forb- 1,2 (5) behavior. One partition is fully loaded to its
maximum capacity, while another partition is
The min-cut problem with size constraints is under utilized with a large capacity left unused.
NP complete [43]. However, because of the import-
ance of the problem in many applications, many
heuristic algorithms have been developed. (V, V2)
Random Partitioning We use a random parti- b
a
tion estimation of min-cut with size constraints
to demonstrate that the quality variation of parti-
tioning results can be significant. Let us simplify b
the case by assigning the modules with uniform
size, i.e., s; for all vi in V, and the nets with
uniform connectivity, i.e., ci for all e; in E. b
Let us assume that the modules are partitioned
into two sets U1, V2 with equal sizes: S(V)= S(V2).
The partition is performed with an independent
random process [10] so that each module has a FIGURE 4 Four possible configurations of net ei {a, b} in a
50% chance to go to either side. For a net e; of random placement.
VLSI PARTITIONING 181
This phenomena is not desirable for certain two subsets V and V2 with comparable sizes
applications. vl and (1 c0 x vl respectively, where
c<l.
The expected cut count equals the probability f
3.1.4. Ratio Cut multiplied by the number of possible nets between
Ratio cut formulation integrates the cut count V1 and V.
and a partition size balance criterion into a single
objective function [87,109]. Given a partition Expec(C(V,, V2))=f IVll x
(V, V2) where V1 and V2 are disjoint and
c(1 c)[VI 2 x f. (7)
V1 U V2 V, the objective funtion is defined as
C(VI, V2) On the other hand, if another cut separates only
(6) one module vs from the rest of the modules, the
s(v,
expected cut count is
The numerator of the objective function minimizes
the cut count while the denominator avoids
Expec(C({vs}, V- {vs))) (Igl- a) f (8)
uneven partition sizes. Like many other partition-
ing problems, finding the ratio cut in a general
network belongs to the class of NP-complete As IVapproaches infinity, the value of Eq. (7)
problems [87]. becomes much larger than Eq. (8).
This derivation provides another explanation
Example Figure 5 shows a seven module exam- why the min-cut separating two fixed modules
ple. The modules are of unit size and the nets are tends to generate very uneven sized subsets. The
of unit connectivity. Partition (V1, V2) has a cost very uneven sized subsets naturally give the lowest
C (gl, g2)/(S(gl) x s(g2)) 2/(4 x 3)= (1/6). cut value. Therefore, the ratio value C(VI, V2)/
Any other partition corresponds to a much larger (S(V1) x S(V2)) is proposed to alleviate the hidden
cost. size effect. As a consequence, the expected value
The Clustering Property of the Ratio Cut The of this ratio is a constant with respect to different
clustering property of the ratio cut can be illust- cuts:
rated by a random graph model. Let us assume
that the circuit is a uniformly distributed random
graph, with uniform module sizes, i.e., si 1.
We construct the nets connecting each pair of
Expec
C(V1,V2)
S(V S(V2) )_f --f (9)
modules with identical independent probability f. Thus, if the nets of the graph are uniformly
Consider a cut which partitions the circuit into distributed, all cuts have the same ratio value. In
other words, the choice of the cuts and the
(Vl, V2) partition sizes does not make difference in such a
uniformly distributed random graph. In a general
circuit different cuts generate different ratios. Cuts
that go through weakly connected groups corre-
spond to smaller ratio values. The minimum of
all cuts according to their corresponding ratios
defines the sparsest cut since this cut deviates
FIGURE 5 An example of seven modules, where partition the most from the expectation on a uniformly
(V, V2) is a minimum ratio cut. distributed graph.
182 S.-J. CHEN AND C.-K. CHENG
V1 V3
and the expected value of cluster ratio equals reach the leaves. Thus, the leaves are ranked level
zero. Each node is one level above the maximum
(
C(Vl,
Expe(Rc)- Expec x_,------,T-i
Z..
V2,"
i=j+ Z..j= i,l
Vk)
IVjI ) level of its children. When the level of the root is
only one, the problem is degenerated to two-way
or multiway partitioning.
=f (17) Each net ei spans a set of leaves. Given a set of
leaves, there is a unique lowest common ancestor.
The level of the lowest ancestor is defined to be
Since f isa constant, all cuts have the same
expected cluster ratio value. Therefore, if we use the level l(ei) of the net.
cluster ratio as the metric, all cuts would be The cost of a net ei is defined to be the multi-
equally favored, which is consistent with the fact plication of its connectivity ci and the weight
that G has no distinct clusters. However, in a w(l(ei)) of level l(ei) for net ei to communicate, i.e.,
general circuit, different cuts generate different ci x w(l(e)). The cost of the multi-level partition is
ratio values. Cuts that go through weakly con- the sum of the cost of all nets, i.e., -]e,E ciw(l(ei)).
nected groups correspond to smaller ratio values.
The minimum of all cuts according to their cluster 3.3.1. J-level K-way Partitioning
ratio values defines the cluster structure of the
circuit since this cut deviates the most from the When the root of the partitioning tree is level j
cuts of a uniformly distributed graph. and the number of branches of each node is no
more than k, we say it a j-level k-way partition.
We can set different communication weights for
3.3. Multi-level Partitioning
each level. Usually, the function is monotone, i.e.,
In multi-level partitioning [4, 23, 47, 58, 67, 68,109 w(1) is larger when level increases. The ver-
110], the final result is represented by a tree struc- tex set Vi of each leaf has its size bounded by
ture. All the modules are assigned to the leaves S S(Vi) S
of the tree. The tree is directed from the root to- For electronic packaging, the tree is bounded
ward the leaves. The level of the nodes is defined by the number of external connections. We call a
to be the maximum number of nodes to traverse to leaf is covered by a node if there is a directed path
184 S.-J. CHEN AND C.-K. CHENG
from the node to the leaf in the tree representa- min Ci 21(ei)
tion. For each node ni, we define T; to be the union eiCE
of the modules in the leaves covered by node
subject to the constraint on the capacity of the
n;. Let E(Ti) be the external nets of Ti, i.e.,
leaves, i.e., S(Vi)< S, where Vi is the vertex set of
E(Ti)={eil O < [eiA Til < [eil}. The cut count of leaf i. The level of the root is adjusted according
each node should not exceed the capacity of the
to the minimization of the objective function.
external connection of the packaging, i.e.,
Example Figure 8 illustrates a generic binary tree
C(Ti)- Z
ejE(Ti)
cj < Cap(l(ni)) (18) for partitioning. In this figure, the root is at level
three. Each node has at most two children.
where Cap(l(ni)) is the capacity of the external
connection of level l(ni). 3.4. Replication Cut
Example Figure 7 shows an example of a 3-level In the replication cut problem, a subset of the
5-way partitioning structure. The leaves are at circuit may be replicated to reduce the cut count
level 0 and the root is at level 3. Each node has of a partition [54, 64, 82]. In this section, we use a
at most five children. Net ei {Vl, 12, 13} is covered two-way partition to illustrate the problem. We
by node na at level l(na)= 2. fix two modules vs and vt at two sides of the cut.
We use three vertex sets to represent the partition,
V1, V2, and R, where V1, V2, and R are disjoint
3.3.2. Generic Binary Tree and V1U V2UR= V, vs V1, vt V2. Subsets V1
A generic binary tree structure [110] is proposed and V2 are separated by the cut and subset R is
to simplify the multi-level partitioning. There to be replicated at both sides (Fig. 9).
is only one constant S, to set in the binary tree. Each copy of R needs to collect a complete set
Thus, it is much easier to make a fair comparison of input signals in order to compute the function
between different algorithms. properly. Thus, the nets from V to R and from V2
In a generic binary tree, each internal node has to R are duplicated. However, the output signals
exactly two children. The weight of each level is of R can be obtained from either copy of R. For
defined to be w(l)--21. Thus, we have the objective example, nets from the right side R to V in Figure
function 9(b) are not duplicated because V gets inputs
(a) (b)
FIGURE 9 Replication cut problem: (a) the three sets of nodes V, R and V2; (b) the duplicated circuit with R being replicated.
from the left side R. For the same reason, we do minCR(V,,V2)- C (19)
not replicate the nets from the left side R to V2. e, ER(V ,V2)
(v, v:)
- -
(v v:) (v.
(v --+ ) e(v:
Let St and S, denote the size limits on the two
v)
).
VfhV--O, R--V-V-V2.
Interpretation of the Replication Cut Suppose
we rewrite the replication cut in the format:
-
(v: v) (v )
(v ’) (v --+ -
186 S.-J. CHEN AND C.-K. CHENG
VI V2
it is called coarse-grained. The interpartition where L is the set of all loops. Note that the
delay 6 on crossing nets is inherently coarse- iteration bound of a given circuit yields a lower
grained and cannot be split. bound on the achieved clock period by retiming.
188 S.-J. CHEN AND C.-K. CHENG
(ii) Latency Bound Let p denote the I0-W-critical of cut count, subject to St < S( V1) <_ S,,
path with maximum path delay among all IO-W- St _< S(V2) <_ Su, J(V1, V2) _< ), and M(V1, V2) _</17/.
critical paths from vi to vj.. Since the number of
Example Figure 13 illustrates the effect of repli-
registers in path p is equal to W(i,j), the I0 latency
cation on the iteration bound. Let us assume that
(i.e. (W(i,j)- 1) x T) between vi and vj. is not less
the interpartition delay is 6=4. Before replica-
than dp + dp, where T denotes the clock period,
tion, the iteration bound is dominated by loop ll.
and dp and dp are the sum of combinational block The bound is equal to
delays and the sum of interpartition delays on path
p, respectively. Thus, we define latency bound M as dr, + dl 8 +2 x4
follows [85, 86]" 4
4. (23)
rll
M(V,, V2) max{dp + dpl p P,ow}, (22) After replication [85], the bound contributed by
loop l is equal to
where PIOW is the set of all lO-W-critical paths.
Latency bound also imposes a lower bound on the dll + dll 8
2. (24)
system latency achieved by using retiming. An rll 4
all-pair shortest-path algorithm can be used to The iteration bound now is dominated by the
calculate the latency bound. union of loops l and 12,
We have two reasons to use the iteration and
latency bounds. (i) It is faster to calculate these d,+ + d11+ 18 +2x4 3.25, (25)
bounds. (ii) The iteration and latency bounds rl+12
stand for the lower bounds of the clock period and which is smaller than the iteration bound before
system latency achieved by adopting retiming, re- replication.
spectively. The partition with lower iteration and
latency bounds can achieve better clock period
3.6. Clustering
and system latency by using retiming. Therefore,
we want to generate a partition with small iteration Clustering [6] is similar to multiway partitioning
and latency bounds. in that the process groups modules into k subsets.
Statement of the Problem Now we state the per- However, for clustering the number of subsets is
formance-driven partitioning problem as follows: usually much greater than for a typical multiway
Given hypergraph H(V, E), two numbers (1 and 1I, partitioning problem, e.g., k >_ 10.
bounds of sizes St and Su, and interpartition delay Often, a clustering process is used as part of
6,find a partition VI, V2) with the minimum number a divide and conquer approach. Thus, it is
important to choose an objective function that cost for each module if it were to be shifted, so that
fits the target application. If the goal is to reduce we can rank the modules for the next move. Such
problem complexity, we set the objective function cost revision can be expensive if the circuit has
to be" large nets which contain huge numbers of pins,
e.g., hundreds of thousand pins.
k The shift model reduces the complexity of the
C(Vi)
min (26) cost revision by utilizing the property that for
i=1 Cl(Vi)
huge nets most shifts of its pins do not change
where Vi’s are disjoint vertex sets and their union the cost of the other pins in the net.
is equal to V. Function C(Vi) is the external cut Let us simplify the description by considering a
count of cluster Vi and CI(Vi) is the count of nets two way partitioning. The model can be extended
connecting vertex set Vi, i.e., eix(vi) ci. to multiple way partitioning according to the
For performance driven clustering, the objective choice of objective functions. Let module v
be
function is to minimize the number of cuts be- shifted from vertex set V1 to V2. The configuration
tween registers. of nets ei E({vj.}) connecting module vj. is revised.
For each net ei, we denote ki to be the number
of pins of ei in V1 and ]ei]- ki the number of pins
of ei in V2 (Fig. 14). With respect to net ei, we
4. MULTIPLE PIN NET MODELS
update the pin numbers ki and lei]- ki after mod-
ule v.. is shifted. We also update the cost of mod-
The handling of multiple pin nets strongly depends
ules in nets
on the partitioning approach [102]. A proper
model is needed to reflect the correct cut count and 1. If the revised ki>_2, the potential cost of
improve the efficiency. In this section, we first pins due to net ei is zero. For the case that
introduce a shift model which is used for itera- ]ei]-ki =1, we increase the cut count by ci
tions of shifting a module or swapping a pair of and set the potential cost of pins in ei. Other-
modules. We then describe a clique model which wise, the move has no effect on the cut count
is used to replace a multiple pin net. The star and and potential cost.
loop models are variations of two pin net mod- 2. If the revised pin count ki 1, the shift of the
els, however, with less complexity than the clique last pin of ei in V will decrease the cut count
model. Finally, a flow model is introduced for net- by ci. We then update the potential cost of this
work flow approaches. last pin.
3. If ki=O, the cut count reduces by c;. However,
the shift of any pin v ei from V2 to V1 will
4.1. Shift Model increase the cut count. Thus, in this case, we
The shift model [101] for multiple pin net is useful reflect the cost of potential shift on the pins of
when we perturb the partition by shifting one ei, which takes O(]eil) operations.
4.2. Clique of Two Pin Nets 4.4. Loop Model of Two Pin Nets
Some researchers use cliques of two pin nets to A loop model reflects the exact cut count [22],
model multiple pin nets. Given a multiple pin net however, it is sensitive to the order of the pins.
6’i, we construct a clique of (1/2)[eil(leil- 1) two We can derive heuristic ordering of the pins us-
pin nets to connect all pairs of pins in the net. ing a linear placement. Modules are sequenced ac-
The clique model maintains the symmetric rela- cording to their x coordinates in the placement.
tion of the modules of the same net in the sense We find the partition by collecting the modules
that the order of the pins in the net has no effect according to the sequence.
on the cost. Following the order of the modules in the x
The weight of two pin nets in the clique module coordinates, we link the modules of a multiple
is adjusted by some factor. One approach is to pin net with two pin nets into a loop. We link the
use 2/lei to scale down the connectivity. The total pins in a sequence (Fig. 15) alternating on every
weight of all the nets in the clique is (2/leil) x other module. The loop is formed by the two con-
(1/2)[eil(lei[ 1)c i--- (lei[ 1)Ci. Note that it takes nections at the two ends.
lei[- two pin nets to form a spanning tree of A factor of (1/2) is assigned to the two pin nets
[eil modules. so that the cut count separating modules accord-
Other factor has been proposed such as 1/ ing to the sequence is one. The model remains cor-
(leil- 1) which is based on a different probability rect even if any two consecutive modules in the
model. However, no factor can exactly reflect the sequence swap their order.
cost of a multiple pin net model.
Complexity of the Clique Model The complex-
ity of the clique model is high. There are O(leil 2) 4.5. Flow Model
two pin nets in a clique model. Suppose the process For the network flow approach, we consider each
of each two pin net takes a constant time. It takes net ei as a pipe. A set of saturated pipes forms a
O(lei[ 2) operations to process a multiple pin net bottleneck of the flow. The union of the saturated
ei. Therefore, in practice, if the pin number is pipes becomes the cut of the circuit. In such a
larger than a threshold, the net is ignored in the model, we set the capacity of the pipe equal to the
process. corresponding connectivity ci [52].
Let Xiu be the amount of flow from pin 1 to net functions. For example, we can apply group migra-
e, and x,a. be the amount of flow from net e, to pin tion to multiway [98,99] or multiple level parti-
va. (Fig. 16). The total flow injected into the net tioning problems [67, 68] with modification to the
should be smaller than or equal to its capacity and cost of the moves. Furthermore, some methods
the incoming flow is equal to the outgoing flow, may be combined to solve a problem. For ex-
i.e., ample, we can use clustering to reduce the size
of an input circuit and then use group migration
Z
li C
xiu cu’ (27) to find a partition of the reduced circuit with
much greater efficiency [24, 59]. In fact, this strategy
Xiu Xui- O. (28) derives the best results in terms of CPU time and
eu eu cut count in recent benchmark [2].
(b)
Ivl!
(IV[/2)!2
/21vl (29) FIGURE 17 Construction of serial and parallel graphs.
Although the number of combinations is huge, constructed from the basic graph by a series of
we have found that the application to small cir- serial and parallel processes.
cuits is practical. We improve the efficiency of Serial Process Given two serial and parallel
the pruning by ordering the modules according to graphs, G(V,E1) and Gz(V2, E2), we construct a
their degrees, i.e., the number of nets connecting serial and parallel graph G(V, E) by merging the
to the modules, in a descending order. With an sink module Vl of G1 and the source module v,;2
elegant implementation, we can find optimal solu- of G2 (Fig. 17(a)). The source module V.l of graph
tions when the number of modules is small, G becomes the source module of graph G, i.e.,
e.g., vl _< 60. v. v. The sink module vt2 of graph G2 becomes
the sink module of graph G, i.e., vt vt2.
Parallel Process Given two serial and parallel
5.2. Dynamic Programming for a Serial graphs, G(V,E) and Gz(V2, E2), we construct a
and Parallel Graph serial and parallel graph G(V, E) by merging the
source module vs of G and the source module
For the special case where the circuit can be
v.2 of G2 and by merging the sink module Vtl
represented by a serial and parallel graph of unit
of G1 and the sink module vt2 of G2 (Fig. 17(b)).
module size, we can find a minimum two way
The merged source module and merged sink
partition (V, V2) with size constraints in poly-
module become the source module v and the sink
nomial time. In this section, we first describe the
module v of graph G, respectively.
serial and parallel graph. We then depict a dy-
Dynamic Programming The dynamic program-
namic programming algorithm that solves the
ming algorithm performs a bottom up process
partitioning problem on this class of graphs.
according to the construction of the serial and
We assume that all modules are of unit size, i.e.,
parallel graph. It starts from the basic serial and
Si-- 1.
parallel graph. For each graph G(V, E), we derive
A serial and parallel graph can be constructed
two tables.
from smaller serial and parallel graphs by serial
or parallel process. Each serial and parallel graph a(i,j): the minimum cut count with modules on
has a source module v. and a sink module vt. A the left hand side and j modules on the right
graph G(V, E) with two modules, V {v., vt} and hand side under the condition that source
one edge E={e}, e={v, vt} is a basic serial and module v is on the left hand side and sink
parallel graph. A serial and parallel graph is module v is on the right hand side.
VLSI PARTITIONING 193
b(i,j): the minimum cut count with modules on bl and a2. For the combinations of tables al and
the left hand side and j modules on the right b2, the merged module (by merging vtl and ;s2)
hand side under the condition that both is on the right hand side. For the combinations
source module v and sink module vt are on of tables bl and a2, the merged module is on the
the left hand side. left hand side. For table b(i,j), we try all combi-
nations of tables al and a2 and all combinations
Let graph G(V,E) be constructed with
of tables bl and b2. For the combinations of tab-
G(V1,E) and G2(V2, E2) by one of the serial and les al and a2, the merged module is on the right
parallel processes. Let a, b be the tables of
hand side. In terms of G2, its source module v2
graph G and a2, b2 be the tables of graph G2.
is on the right hand side and its sink module vt2
We construct the tables a, b of graph G(V, E) as
is on the left hand side. Thus, the indices of
follows.
table a2 are reversed, i.e., a2(m,k) instead of
Table Formulas for Parallel Process
az(k,m). For the combinations of tables b and
b2, the merged module is on the left hand side.
a(i, j) mink+m=lv21a (i + k,j + m)
+ az(k,m), Vi +j IVI, (30)
5.3. Group Migration Algorithms
b(i, j) mink+m=lv21bl (i + 2 k,j- m) The group migration algorithm was first proposed
+bz(k,m), Vi+j ]V]. (31) by Kernighan and Lin [60] in 1970. Since then,
many variations [15, 26, 27, 33, 39, 45, 49, 84, 97-
For table a(i,j), we try all combinations of 99, 108, 111, 116] have been reported to improve
tables al and a2 with the constraint that the num- the efficiency and effectiveness of the method.
ber of modules on the left hand side is and the Today, it is still a popular method in practice.
number of modules on the right hand side is j. The probability of finding the optimum solu-
Note that the extra addition of in the index is tion in a single trial drops exponentially as the
used to compensate the merging of the two source size of the circuit increases [60]. Using the origi-
modules or the sink modules. For table b(i,j), we nal version, Kernighan and Lin showed that the
try all combinations of tables b and b2 with the probability of obtaining an optimal solution is a
same size constraint. function of the problem size, p(I vl)- 2 -n/30.
Table Formula for Serial Process In other words, if the circuit size is large, then the
heuristic Kernighan-Lin algorithm is unlikely to
jump out of local minima, and so the optimum
a(i, j) min(mink+m=lv21al (i- k,j + m)
q- bz(m, solution will not be found. The progress made by
researchers on the method has definitely pushed
minz:+m=lV21 bl (i + k,j- m) the envelope further.
+a2(k,m)), Vi+j IvI, (32) In this section, we concentrate on two-way min-
cut with size constraints. The method is flexible
b(i,j) min(mink+m=lv21al (i- k,j + m) and can be extended to other partitioning pro-
+ a2(m, k), blems with modifications of the moves and the
cost function.
min+m=lV21 b (i + k,j- m) The algorithm performs a series of passes. At
+ b2(k,m)), Vi +j IVI. (33) the beginning of a pass, each module is labeled
unlocked. Once a module is shifted, it becomes
For table a(i,j), we try all combinations of locked in this pass. The group migration algorithm
tables a and b2 and all combinations of tables iteratively interchanges a pair of unlocked modules
194 S.-J. CHEN AND C.-K. CHENG
5.3.3. Data Structure efficient to search and revise the module order in
the bucket structure. In fact, it is proven that us-
The choice of data structure strongly depends on
ing the bucket structure and cut count as the objec-
the cost functions, gains, and the characteristic of
tive function, it takes linear time proportional
VLSI circuitry. A sorting structure such as heap
to the total number of pins to perform each pass
or AVL tree is a natural choice to sort for the
[4o].
top modules. However, for the case that the gain
differs by a very limited quantities, an array struc-
ture can simplify the coding and the complexity. 5.3.4. Gains
(i) Heap or AVL Tree We can use a heap or In this subsection, we use cut count as the
AVL tree to sort the modules according to objective function. The extension to other cost
their shift gain. Each side of the partition functions is possible. However, we may loose
keeps a heap. The top of the heap is the efficiency.
module of the maximum gain. The sorting of
each module takes O(1VIlog([ vl )) operations. (i) Shift Gain We use shift model for multiple pin
nets. Given a module vi, we check the set E({vi}) of
(ii) Array (Bucket) of Link List Figure 19 illus-
nets connecting to this module. The contribution
trate a bucket list data structure. The gain is
of each net e E E({vi}) by shifting module vi is the
transformed to the index of the bucket [40].
gain ge(Vi) of the net with respect to module vi.
Modules of the same gain are stored in the
The gain g(vi) of module vi is the total gains of
same bucket by a link list. A bucket is an ef-
all its adjacent nets, i.e., g(vi) e6E({vi}) ge(Vi).
fective data structure when the objective func-
tion is the cut count. The gain of cut count (ii) Swap Gain The swap gain is the sum of
is limited by the maximum degrees of the the gains of two modules vi and vj, deducting the
modules, i.e., degmax -maxv, cVeE({vi}) effect on common nets, i.e., g(vi)+g(vj)--
Thus, the dimension of the bucket is set to eE({vi))fqE({vj})(ge(Vi) + ge(vj)).
be 2 degmax.
(iii) Weights of Multipin Nets The sequence of
For VLSI applications, the degree of modules is the move depends much on the gain calculation.
much smaller than the number of modules. Thus, For a circuit of 1,000,000 modules, suppose the
the dimension of the bucket is small. It is very degree of most modules is less than 100 and each
max module #
gain
module
net is of unit weight. We have roughly 1,000,000 that the shift of module F will be executed at the
modules/200 gain levels 5,000 modules per gain end of the pass. Thus, if module vi is unlocked,
level. To differentiate these 5,000 modules, we
have to adjust the weight of multiple pin nets. p(vi) f(g(vi)). (37)
(iii) (a) Levels with Priority The first level gain is Otherwise, p(vi)=0. Figure 20 illustrates function
identical to the shift gain of cut count. The second f, which increases monotonically. The slope
level gain is equal to the number of nets that have within go and gup amplifies the difference of gains.
one more pins on the same side. Thus, the kth The slope is clamped at two ends Pmax and Pmin
level gain is equal to the number of nets that have (0_<Pmin < Pma_ < 1) which represent the maxi-
k more pins on the same side [65]. The pins on mum potential that the module will shift or stay.
the other side will increase by one after the mod- For each net eE({vi}), its contribution ge(vi)
ule is shifted. Thus, the negative gain of level k to the gain of module vi is the tendency that the
is contributed by the nets with k-1 pins on the whole net will shift with module vi to the other
other side. side. To simplify the notation, let us assume that
Let us assume that module vi is in vertex set V module vi is in V1. Thus, we have the following
to simplify the notation. For each net e/E E({vi}), expression.
we denote kj lej-A V[ the number of pins in V.
Let us define E(+,i,k) to be the set of nets
e./E E({vi}) with kj.=k+l pins in V (the extra one ji, vj Cefq Vl vjeen V2
is used to count module vi itself) and nonzero pins
in V2, i.e., ]e/l > k/. And E(-, i, k) to be the set of where I-Ivjsp(vj) if s is an empty set. The first
nets e/ E({vi}) with no other pins in V and k- term IIji,vjecv,p(vj) in the parentheses is the
pins in V2, i.e., [ej. =k and kj 1. Then, the kth potential that all the pins will shift with module vi
level gain of module vi, gi(k), is the weight to V2. Hence, Ce x 1-Iji,vEeev, p(vj) is the expected
difference of the two sets, E(+, i, k) and E(-, i, k). gain if module vi is shifted. The second term
I-Ivjenv2p(vj) is the potential that the pins in V2
gi(k)- ce- ce (34) will shift to V. Thus, Ce x I-Ivecw2p(vj) is the
eEE(+,i,k) eEE(-,i,k) expected loss if module vi is shifted.
The gain of a module vi is the total gains of the
E(+,i,k)- {ejlej E({vi}),kj-k--t- 1,]ej > kj}
adjacent nets with respect to this module, i.e.,
(35)
g(vi)-- Z
eGE({v,})
ge(vi). (39)
(36)
We compare the modules with a priority on the f(g(v,))
lower level gain. In other words, we compare the
first level first. If the modules are equal at
the first level gain, we then compare the second
level and so on. In practice, we limit the number
of levels by a threshold, e.g., <_ 3.
tll
(iii) (b) Probabilistic Gain In probabilistic gain
model [37], each module vi is assigned a weight
p(vi). The weight p(vi) is a function of the gain g(vi)
of module vi to reflect the belief level (potential)
go gup
FIGURE 20 Function of probabilistic gain.
- g(v,)
VLSI PARTITIONING 197
Net gain ge(V) and module potential p(vi) are For a move associated with a net eu, we can
mutually dependent. We derive the values via either place the critical set Sub into a partition
iterations. Initially, we use the plain shift gain (by other than V, or the complementary critical set
cut count) to derive the potential p(vi)=f(g(vi)). Sub into the partition Vb. The gain of each move
From these initial potentials, we derive the prob- is then computed by evaluating the change of the
abilistic net gain. The net gain is then used to cost due to the move of the critical or comple-
derive the module gain. In practice, we stop after mentary critical set.
a limited number of cycles, e.g., two iterations Usage of Basic Module Moves Although the
([37]). Note that there is no guarantee that the net-based move model provides a different process
iteration will converge. to improve current partition, it is more expensive
After each move, the associated module poten- than the module-based move model because more
tial and probabilistic net gains are updated and modules are involved in each move.
the plain cut count is recorded. Exact cut count is We can mimic the net based move by adding
used when we select the subsequence of move weights to the connectivity of desired nets [38].
to execute. The basic move is still based on the modules.
It has been shown via benchmarks released by However, after module vi is moved, we add more
ACM/SIGDA, the probabilistic gain model pro- weights on the nets connecting to vi, i.e., E({vi}).
duces excellent partitioning results; it outperforms These extra weights encourage the adjacent mod-
the other gain models by wide margins. ules to go along with module vi and thus achieves
the effect of net based move. Empirical study finds
improvement on the partitioning results.
5.3.5. Net-based Move
The net based process [32, 115] is similar to the 5.3.6. Simulated Annealing Approach
module based approach except that all operations
are based on the concept of the critical and com- For simulated annealing [14, 20, 56, 62, 81 ], we can
plementary critical sets. The main differences adopt the basic moves such as module shifting
are (1) Instead of a single module, each move now and pairwise swapping. There is no need of lock
shifts one critical or complementary critical set, mechanism. To allow a larger searching space, we
depending on the type of objective function. For incorporate the size constraints into objective
convenience, we say a move is initiated by a net function, e.g.,
eu if this move is composed of shifting the critical
or complementary critical set associated with e,. C(V1, V2) + a(S(V,) S(V2)) 2. (42)
(2) The locking mechanism is operated on a net,
where a is a coefficient. We can adjust it accord-
that is, if the critical or complementary critical
set of a net has been moved then all the moves ini-
ing to the annealing temperature. As temperature
tiated by this net will be prohibited thereafter.
drops, we gradually increase a to enforce the size
balance.
Given a net eu and a vertex set Vb, let us define
the critical set of net eu with respect to set Vb as
5.4. Flow Approaches
sub eu Cq Vb, (40)
In this section, we assume that the circuit can be
and the complementary critical set of eu with re- represented by a graph G(V, E) with unit module
spect to set V as size, i.e., si and all nets are two pin nets. The
flow approach can be extended to multiple pin
sub eu Vb (41) nets using a flow model.
198 S.-J. CHEN AND C.-K. CHENG
subject to
d/j
Obj" min
C I,
Z cidi
eo.EE
i,j,p <_ gl
199
(55)
(56)
In a uniform multi-commodity flow problem Ivl
[74, 75], the demand of flow between each pair
of modules is equal to an identical value f. As we
E E (A/(P)-Ap(p))
p=l i=l,ip
>_1 (57)
-
be the flow for commodity p on net e0.. The objec- Property I: Triangular Inequality The distance
tive is to maximize f: metric d satisfies the triangular inequality"
Obj" maxf (52) dij 4k dik, Viii, Vj, Fk V (58)
subject to the flow demand from module Vp to the
other modules Property II: Potential Function The term A/(p)-
Ap(p) in expression (56) is equal to the shortest
distance between modules v; and Vp based on net
distances do.. In fact, from triangular inequality,
we obtain A7)- Ap(P)= dip.
/ -f/2 ifi:/-p, and <_i,p<_ IVI, We normalize the objective function (55) with
(IV I-1)f/2 if/-p, and <_i,p<_lVI, the left hand side terms of inequality (57). The
(3) objective function can be expressed as:
specifically, nets connecting modules in different Expression (60), weighted cluster ratio [103], is
sets, Vi, Vj., C j, have the same distance dO. values similar to cluster ratio with a weighted metric do..
(we use do to denote the distance between vertex In general, the solution for the minimum weighted
sets Vi and V. when this does not cause confusion), cluster ratio does not directly correspond to the
while nets connecting only modules in the same partition of optimum cluster ratio. However, if
subgraph have zero distance, d/y 0 (Fig. 22). We distance do. is a constant value between all pairs
can rewrite the denominator of the objective of vertex sets Vi and V then the weighted cluster
function and state the problem as follows. ratio provides the solution for cluster ratio.
Statement of Weighted Cluster Ratio Cut When the nets with positive distance do. form a
[103] Find the distance do and the number of two-way partition, we can show that the partition
partition k with an objective function of weighted defines the ratio cut. When the nets with positive
cluster ratio: distances form a k-way partition with k < 4, we
also find that there exists a two-way partition that
minu,k Wc V1, V2,..., Vk) again defines the ratio cut [28].
m,nu,, (60) THEOREM 5.2 Let net set D {eo.ldO. > 0} define
y,./:j+, ,)f dijS(Vi) S(Vj) a cut that separates the circuit into k disconnected
subsets. If k <_ 4, then there exists a ratio cut that
where distance do is subject to the property of
is a subset of D.
triangular inequality.
According to the mechanism of the duality,
the objective functions of the primal and dual 5.4.3. A Replication Cut for Two-way
formulations are equal when the solution is Partitioning
optimal [25].
We adopt the linear programming formulation of
THEOREM 5.1 For feasible solutions, we have the network flow problem [1, 30], where each module
inequality f <_ Wc(V1, V2,..., Vk). The equality is assigned a potential and a cut is represented
holds when the solution is optimal, i.e., the maxi- by the difference of module potentials as shown
-
mum uniform multicommodity flow equals the in Figure 23. With respect to the directed cut
minimum weighted cluster ratio of any cut,
E(Vl 0’1 ), we use w; to denote the potential dif-
maxxgjf <_ mind,kWc(V1, V2,..., Vk). ference between the cut from module vi V1 to
module v V1. The potential of each module vi is
denoted by Pi. For module vi in V1, pi 1, and for
Pi=O, qi=l
FIGURE 22 Distance between clusters. FIGURE 23 p potential and q potential of each module.
VLSI PARTITIONING 201
_
Obj" min E cowo + E
e CE CE
cjiuo (61)
xij<_cji V1 <_i,j<_ IVI (72)
subject to
--Xij @- Xji )i 0
wo -pi +pj >0 (62)
uij qi + qj 0 (63) Ivl
__Xij @ Xj _qt_ /i VVi E V, Vi L Vs Vt (74)
qi --Pi 0 VVi E V, vi vs, vt (64) j=l
ps (65) Ivl
+ x + as 0 (75)
q.- (66) j--1
(67) Ivl
Pt -0
+ Xjt -t- at 0 (76)
j--1
qt 0 (68)
Ivl
wij, uij 0 Vl i,j <_ Ivl (69) + xj + b 0 (77)
-xj
j=l
To minimize objective function (61), the equal-
ity of constraint (62) holds, i.e., wo.=pi-pg, if Ivl
p-_>_ pg, otherwise, w0-= 0. Similarly, constraint (63) -xtj + x2t + bt 0 (78)
j=l
requires uig= qi qg if qi qj, otherwise uo.= O.
Expression (64) demands potential qi be not less (79)
than potential pz for any module vie V. Since
high potential Pi corresponds to set V1, and high a, at, b, bt unrestricted (80)
potential qi corresponds to set V2, inequality (64)
enforces V1 be a subset of ’2. Consequently, the where inequalities (71), (72) are derived with
requirement that V1 N V2 is satisfied. respect to each wig and u o. respectively. Similarly,
202 S.-J. CHEN AND C.-K. CHENG
Eqs. (73)-(78) are derived with respect to each Pi, Constraints (75)-(78) indicate that as and bs
qi, Ps, Pt, qs and qt. The equality of Eqs. (73)-(78) are the flow injections to module vs in G and its
holds because Pi, qi, Ps, Pt, q and qt are not reversed circuit G; at and bt are the flow ejections
restricted on sign in the primal formulation. from module vt in G and its reversed circuit G’,
Variables Ai, xij, and xij are positive in Eq. (79) respectively. Combining circuit G and G together,
because their corresponding expressions (62)-(64) we have the maximum total flow, as+ b, be the
are inequality constraints. optimum solution of the minimum replication cut
We can view G(V, E as a network flow problem problem.
and interpret cij as the flow capacity, xij as the
flow of net %.. Constraint (71) requires that the
flow x 0. be not larger than the flow capacity ci# on 5.4.4. The Optimum Partition
each net ei#. In constraint (72), the set of nets In this subsection, we describe the construction of
are in a reversed direction and flow x/ is not
replication graph and take an example to describe
larger than the capacity of the capacity c#; of net it. We then apply the maximum flow algorithm
e#i in E. Corresponding to G(V,E), we use on the constructed replication graph to derive an
G’(V,E I) to denote the reversed graph. optimum replication cut. The optimality of the
Constraint (73) has the total flow xij injected derived replication cut is proved by using a net-
from module vi into G be equal to -A;. On the work flow approach.
other hand, constraint (74) has the total flow xij Construction of Replication Graph Given a cir-
injected from module vi, into G be equal to cuit G(V, E and modules Vs and vt, we construct
Suppose we combine Eqs. (73) and (74), we have another circuit G’(V’,E’) where V’ 1=1 V[ with
each module v in V corresponding to a module vi
--Xij + Xji
J
"i ZJ X. Xi. (81) in V, and ]E’l= EI with each directed net eij in E’
in the reverse direction of net %. in E. We create
.
This means that the amount of flow Ai which
emanates from module v; in G enters its corre-
sponding module in vi, in G
super modules v and v and nets (v, v), (v, v),
(vt, v’), and (v’t, v’) with infinite capacity as shown
in Figure 24. From every module vi in V except vs
x X
O:D
X’ X’
FIGURE 24 The replication graph G*.
VLSI PARTITIONING 203
and vt, we add a directed net of infinite capacity (Fig. 26). Thus the sets V1 {v, Va} and V2 {vt}
to the corresponding module v in V t. We refer to define an optimum replication cut R(V1, V2) with
the combined circuit as G*. R {vb, vc} and a cut cost equal to 5 (Fig. 27).
Polynomial-time Algorithm The optimum repli-
The network flow approach leads to the opti-
cation cut problem with respect to module pair
mality of the solution as stated in the following
vs and vt and without size constraints can be theorem.
solved by a maximum-flow minimum-cut solution
of the circuit G* with v as the source and v as THEOREM 5.3 The replication cut R(X,f() derived
the sink of the flow (Fig. 24). Suppose the from the transformed circuit G* generates the
maximum-flow minimum-cut finds partition minimum replication cut count CI(X,f(I) (expression
(X,X) of V with vsE X and vt X and partition (19)).
(X,2’) of V’ with vs X and v 2o Then a repli-
cation cut (V1, V2) of the original circuit with
VI=X,
.
V2-{ii’2’} andR=V-V-V2isan
optimum solution. Note that V2 is derived from
the cut in vertex set V To simplify the notation,
we shall use (X,2) to denote the derived replica-
tion cut of G.
Example Given a circuit in Figure 25, its
replication graph G* is constructed as shown in 3
Figure 26. The maximum-flow minimum-cut of
G* derives (X,2) ({vs, Va}, {vb, Vc, vt}i and FIGURE 25 A five module circuit to demonstrate the replica-
v,
2’)- ({v, Va, vc}, {vt}) with a flow amount, 5 tion cut.
FIGURE 26 The constructed replication graph of the circuit shown in Figure 25.
204 S.-J. CHEN AND C.-K. CHENG
I
I
t
I
!
5.4.5. Heuristic Flow Algorithms Two Way Partitioning using Maximum Flow
Minimum Cut
We introduce the heuristic approaches that accel-
erate the flow calculation and take advantage 1. Find two seeds as v and vt.
the optimality properties of the flow methods. 2. Call Maximum Flow Minimum Cut to find
We first introduce an approach that utilizes the partition (V1, V2).
maximum flow m,i.nimum cut method for the min 3. If S(V1)> S(V2), find a seed vie V, merge
cut with size constraints. We then explain a short- {vi} U V2 into a new sink module v.
est path method for multiple commodity flow 4. Else find a seed vie V2, merge {vi} V1 into
calculation. a new source module Vs.
5. Repeat Steps 1-4, until S < S(VI) < Su and
(i) Usage of Maximum Flow Minimum Cut We S < S(V2) < S
adopt a heuristic approach [113] to get around the
unbalanced partition of the maximum flow and We can use parametric flow approach recur-
minimum cut method. First, we find two seeds sively to the maximum flow minimum cut prob-
as the source and the sink modules, vs, yr. We then lems recursively (Step 2). The total complexity
use the maximum flow and minimum cut meth- is equivalent to a single maximum flow minimum
od to find partition (V1, V2) with vsE V1 and cut.
vt E V2. Suppose the size S(VI) of V is larger than The seeds are chosen according to its con-
the size S(V2) of V2, we find from V a module vi nectivity to the vertex set in the other side. The
to merge with V2 and shrink set V2 as a new sink result is sensitive to the choice of the seeds. We
module. Otherwise, we find from V2 a module vi to can make multiple trials and choose the best
merge with V1 and shrink set V as a new source results. Other methods such as programming ap-
module. We repeat the maximum flow minimum proach can serve as a guideline on the choice of
cut process on the graph with new source or sink the seeds [79,80]. The method has shown to
module until the size of the partition fits the size derive excellent results with reasonable running
constraint. time.
VLSI PARTITIONING 205
(ii) Approximation of Multiple Commodity Flow formulation in Step 2.3.1). Step 2.1 uses a random
Based on the multicommodity flow formulation process with even distribution over all modules
[103], we try to solve a multiple way partitioning to pick two distinct modules, and Steps 2.2-2.3
by deriving approximate multiple commodity inject A amount of flows along the shortest path
flow with a stochastic process [13, 55, 114, 117]. between the modules. In Steps 2.3.1- 2.3.2, the dis-
tances of the nets whose flow has been increased
Given a circuit H(V, E ), the flow increment A,
are recomputed using an exponential function
and the distance coefficient c, the algorithm starts
de=exp((c x f(e))/Ce) to penalize the congested
with procedure Saturate-Network to saturate
nets, where de and f (e) are the distance and flow
the circuit with flows. A stochastic flow injection
of net e, respectively. Steps 2.1- 2.3 are iteratively
algorithm is adopted to reduce the computational
executed until a pair of modules are chosen where
complexity. Then, Select-Cut is activated to select
all possible paths between them are saturated by
a set of nets by the flow values to constitute a cut.
flows. These saturated nets identify a partition
The conversion from weighted ratio cut to cluster
of the circuit.
ratio cut is performed by a Select-Cut routine
Figure 28 shows a sample circuit saturated
which selects the subset of the cut derived from
by flows after executing Saturate-Network with
Saturate-Network with a greedy approach. A 0.01 and c 10. The flow values are shown by
Multiple Commodity Flow Approximation
the numbers right beside each net. The dashed
(H, A, cO lines indicate the cut lines along the set of
1. Iterate the following procedures saturated nets to form the three clusters. These
saturated nets define an approximate weighted
1.1. Saturate-Network (H, A, c).
cluster ratio cut which are potential set of nets for
1.2. Select-Cut (H) until the clustering result are
a selection of cluster ratio cut.
satisfactory
2. Output clustering result.
5.5. Programming Approaches
Procedure Saturate-Network (H, A, c)
For programming approaches [7, 18, 35, 41,44, 46],
1. Set the distance of each net e to be one. we adopt two way minimum cut with size con-
2. While (H is connected) do Steps 2.1 to 2.3. straints as the target problem. We assume that
2.1. Randomly pick two distinct modules v the nets are two pin nets and thus, the circuit can
and vt. be described as a graph G(V, E). We also assume
2.2. Find the shortest path between v and vt. the modules are of unit size, i.e., si 1.
2.3. For each net e on the shortest path, let The two way partition (V1, V2) is represented by
a linear placement with only two slots at coordi-
f (e) and de be the flow and distance of nates and 1. For an even sized partition, half
net e.
of the modules are assigned to each slot. Let X
2.3.1. If n is not saturated, increase f (e) denote the coordinate of module vi. If vie
by A and set de exp ((c x f (e))/ Xi--- 1, else Xi---- for 11 E V2. The cut count can be
expressed as follows.
2.3.2. If e is saturated, set de to be
3. Output E with flow informations.
The initial distance of each net is one since
there is no flow being injected (see the distance
c(u ,
-
where X is a vector of x;, and X is the transpose
of vector X. Matrix B has its entry b0.=-c 0. if
206 S.-J. CHEN AND C.-K. CHENG
node net
cut line
.55
.67
- .65
...--’:"""
! \ .,l.OO’:-.-./_ "1.00
lvX- O,
X X- IVl
Matrix B is symmetric and diagonally semido-
minant. Thus, it is semipositive definite, i.e., all
eigenvalues are nonnegative. And its eigenvectors
are orthogonal. Let us order its eigenvalues from
small to large, i.e., A0_< A1... < AlVl_l. The smal-
lest eigenvalue A0=0 with its eigenvector X0 1.
The second eigenvalue A1 is nonnegative with its
(83)
(84)
C(Vl, V2)
where matrix
ii
--
To push for a higher lower bound, we can
adjust the diagonal term of matrix B by adding
di. Let
C(Vl, V2)
4 2----" di
l<i<lvl
x-x
has its entry
bii + di. Either xi
l<i<lvl
di x X2i
i- b if C j, else
1, the last two
or xi
terms cancel each other. The modification thus
does not alter the optimal partition solution.
The new nonlinear programming problem is to
find the assignment of d to maximize the objec-
(86)
--
tor X is an optimal solution to objective function tive function [11]:
(82) with constraints (83) [46]. Since X-rX=IV
Eq. (84) the solution
4
l<i<lvI
X-[BX1 /1 ?
X X ,,l x IVl, (85)
where /1 is the second smallest eigenvalue of
which is a lower bound of the min-cut problem. matrix l. The solution is an upper bound of the
VLSI PARTITIONING 207
partition. It is larger than A1 in the sense that A1 size. The two-way partition is described by a vec-
can serve as an initial feasible solution to maximize tor x= (Xl,1,..., x,n, x2,,..., x2m), where Xb,i is
expression (87). if module vi is assigned to vertex set Vb,
Remarks The programming approach finds a
otherwise xb,i is 0. If modules vi and are in v.
different vertex set, the value of the term
global view of the problem [9, 79, 80, 118]. How-
Xl,iX2,jqt-X2,iXl,j is equal to 1. This contributes
ever, the formulation is very restricted. The exten-
one interpartition delay 8 into the delay of the
sion to multiple pin nets and the incorporation
net eij. Let gt(x) denote the delay to register ratio
of fixed modules will destroy the nice structure
of loop I. Delay ratio gt(x) can be written as the
based on which we have the eigenvalue and eigen-
following formula:
vector as optimal solutions. Therefore, it is diffi-
cult to utilize the approach recursively. dg @ etjEl X (Xl,iX2, j @ X2,iXl,j)
gl(x) (88)
For a general case, we can view the problem rl
as nonlinear programming with Boolean quad-
Given a path p, the total delays hp(x) of p is as
ratic objective function. Nonlinear programming
follows"
techniques are adopted to derive the results
[16,107]. hp(x) dp Av Z
eo.Ep
x (Xl,iX2,j Av x2,iXl,j) (89)
Lagrange multiplier is one useful tool for perfor- min cij(x,,ix2,j + X2,iXI,j), (90)
mance optimization. In this section, we demon- ejE
strate the usage of Lagrange multiplier for
performance driven partitioning. The problem is subject to the following constraints:
to optimize the performance of a two-way C1 (Size Constraints)
partition (V1, V2) with retiming [86].
We first introduce a vector of binary variables Ivl
to represent a partition. The performance-driven ZXb,iSi
i=1
S V b E {1,2}. (91)
partitioning problem is thus represented by a
Boolean quadratic programming formulation with C2 (Variable Assignment Constraints)
nonlinear constraints. We then absorb the non-
linear constraints into the objective function as a 2
Lagrangian. We use primal and dual subproblems
to decompose the Lagrangian and derive the
Z xb’i
b=l
V Yi g. (92)
LEMMA Given a number ), if gl(x) is less than or subject to C1 and C2, where /3 represents the
equal to )for any simple loop l, then g(x) is less constant contributed by A.
than or equal to J for all loops 1.
Let 7rc and 7rp represent the number of the simple 5.6.2. Subgradient Method using Cycle
loops and the number of /O-critical paths, Mean Method
respectively. Let A denote the vector (Ag,,..., We solve the partitioning problem through primal
Auc, Ah,,..., Ahp). Using Lagrangian Relaxation and dual iterations on the Lagrangian. A Quad-
[104], we absorb the constraints (93) and (94) into ratic Boolean Programming, QBP, [16] is used to
the objective function (90). The Lagrangian- solve the primal problem and generate a solution
relaxed problem is as follows. x (Step 2).
For the dual problem based on x, we select the
max min
A>0 x
L(x,A) (95) set of loops and paths that violates the timing
constraints as active loops and paths. The nets
subject to constraints C1 and C2, where contained in the active loops or paths are termed
active nets.
t(x,/) Z
+
ij(Xl,iX2,j --t- x2,iXl,j)
Z
V simple loop
Ag, (gl(x) 1) (96)
Active Loops and Paths Given a solution x, a
.
loop is called active, if g;(x) is not less than J. A
path p is called active, if hp(x) is not less than
Active Nets Given a net e, we define e to be an
active net, if net e is covered by an active loop or
V/O-critical path p an active path.
We call a minimum cycle mean algorithm [57]
(i) The Dual Problem Given vector x, we can and an all-pairs shortest-paths algorithm to mark
represent (96) as a function of variable A, i.e., all the nets on active loops and paths, respectively
Lx(A). Thus, the dual problem can be written as: (Step 3). For every net eij on active paths, we
record q0: the maximum path delay among all
maxL(A)
A>0
(97) paths passing through e0.. For every net eij on
active loops, we record Po: the maximum delay-to-
(ii) The Primal Problem Let FO. and Qo denote register ratio among all loops passing through e0..
the sets of the simple loops and/O-critical paths We then calculate the subgradient on the marked
passing the net e0.. The cost a o. of net e 0. is nets and update the constants a o. for the next
composed of connectivity c,../ and the penalty of primal dual iteration (Steps 4-5). We increase the
the timing constraints. costs of active nets using subgradient approach
[104]. The iteration proceeds until the bound of
a ij c ij -+- Z -l A + Z tSA
IEFij
g
pEQij
h (98) all loops and paths are within the given limits.
3. Calculate the iteration and latency bounds of tree structure. Let the root correspond to the
the partition (Vk),Vk)),respectively. Stop whole circuit, the leaves correspond to the
if timing constraints are satisfied. Otherwise, smallest clusters, and the internal nodes corre-
- -
revise P0 and qij for all nets eij. spond to the intermediate clusters. Hence, the
4. Compute size of the clusters grows with the level of the
nodes. Top down clustering creates clusters
lc(vl v] corresponding to nodes in high levels, while
ej E (P j )2 -ll- e eE q O" 21’/) 2 bottom up clustering creates clustering corre-
sponding to nodes in low levels.
5. Revise shadow price a 0. for all nets eij E E: For example, in [60], Kernighan and Lin
a(+/_
0
a!./.
J proposed a top down clustering approach,
(k+ )
if net e!j is in active loop, then tij a!.k) +
tj which divides the whole circuit into four clusters
only. In [59], Karypis et al., used a bottom up
if net e 0. is in active path, then aij(+1) a!.) +
zj clustering which starts with clusters of two
modules or a net. If we continue the application
6. While k <_ MaxNumIter, set k k+ and goto of bottom up clustering on intermediate clus-
Step 2. ters, the quality of the clusters degenerates as
the size of the clusters grows bigger.
5.7. Clustering Heuristics Iteration of Clustering and Unclustering: We
go through the iterations of clustering and un-
We first discuss the usage of clustering heuristics.
clustering to improve the quality of the results.
We then discuss top down clustering and bottom
At each level of the hierarchical tree, we derive
up clustering approaches. At the last, we discuss
an intermediate target solution, e.g., a two-way
some variations of clustering metrics.
partition. In unclustering, we go down the level
of tree hierarchy to find an expanded circuit
5.7.1. Usage of Clustering Heuristics with more modules. In clustering, we go up
the level of tree hierarchy with a circuit of a
The usage of clustering heuristics plays an
smaller number of modules. The previous parti-
important role in determining the quality of the
tioning result becomes the initial of the new par-
final results. In the following, we discuss the issue
titioning problem. Note that the hierarchical
in different topics. We use a two-way partitioning
tree is constructed dynamically. For each clus-
with size constraints as the target problem.
tering, the modules can be grouped based on
1. Top Down Clustering versus Bottom Up Clus- the current partitioning configuration.
tering: Top down clustering approach provides The Clustering Operations and the Target
a global view of the solution. The operations Solution: The clustering operation has to be
are consistent with the target problem. How- consistent with the target solution. For exam-
ever, it is more time consuming because the ple, suppose the target is finding a two-way
clustering operates on the whole circuit [29]. min-cut with size constraints. Then, it is natural
Bottom up clustering is efficient. However, be- to cluster modules based on net connectivity
cause the process operates locally, the target because the probability that a net is in an opti-
solution is sensitive to the clustering heuristics mal cut set is small (see the subsection of
[59]. min-cut with size constraints in problem for-
2. The Level of the Clustering: Suppose we rep- mulations). Moreover, it is important that the
resent the clustering results with a hierarchical clustering follows the current partitioning
210 S.-J. CHEN AND C.-K. CHENG
results, i.e., only modules in the same parti- the top down clustering process. Other partition
tion are clustered. approaches can also be used to replace the ratio
cut. A group migration method is used to find
5.7.2. Top Down Clustering Approach a minimum cut of the contracted hypergraph
for Partitioning with size constraint. Finally, we apply a last run
of the group migration algorithm to the original
We use an application to two-way cut with size
circuit to fine tune the result.
constraints to illustrate the top down clustering
approach [24, 29]. The partitioning of huge designs
Input a hypergraph H(V,E), an integer k
for the number of expected clusters, an integer
is complicated and the results can be erratic. Our
num_of_reps for repetition, and St, S, for the size
strategy (Fig. 29) is to reduce the circuit complex-
constraints of two resultant subsets.
ity by constructing a contracted hypergraph.
The clusters for the contracted hypergraph are 1. Initialize tI, { V} and V *= V.
searched via a recursive top down partitioning 2. Apply ratio cut [109] to obtain a partition
method. The number of modules is much reduced (A, A’) of V* A U A’.
after we contract the clusters. Hence, a group mig- 3. Set P=(-{V*})U{A,A’}. Set V* to be a
ration approach can derive excellent two way cut vertex set in P such that S(V*) maxv/ S(Vi).
results on the contracted hypergraph with much 4. While S(V*) > ((S( V ))/k), repeat Steps 2, 3.
efficiency. Furthermore, since the clusters are 5. Construct a contracted hypergraph Hr(Vr, Er).
grouped via a top down partitioning, concep- 6. Apply num_of_reps times of a group migration
tually a minimum cut on the hypergraph can take algorithm to Hr with the size constraints St, S,.
advantage of the previous results and generate 7. Use the best result from Step 6 to the circuit
better solutions. H as an initial partition. Apply a group migra-
In this section, we describe a top down clus- tion algorithm once to H with the size con-
tering algorithm. A ratio cut is adopted to perform straints St, S,.
H(V,E)
C
Construct
HF(VF’EF) --i C2
(_ /
-"(C)- _/// Partition
Hr(Vv, Er)
The choice of cluster number k It was shown p + L <i < p + U and density d(i) is minimum
[24] that the cut count versus cluster number k is a among d( p + L) d(p + U).
concave curve. When k is smll, the quality is not 3. Cluster modules between slots p and i. Set
as good because the cluster is too coarse. When k p=i+l.
is large, there are too many clusters. We lose the 4. Repeat Steps 2, 3 until the scan reaches the
benefit of the clustering. right end.
For the case that the circuit is large, we may
Remark The proposed clustering process and
need to adopt multiple levels of clustering to push
the criteria are consistent with the target linear
for the performance and efficiency [58, 66].
placement application. The whole process depends
on an efficient and effective linear placement.
5.7.3. Bottom Up Clustering Approaches
(ii) Performance Driven Clustering For perfor-
In this section, we discuss bottom up clustering mance driven clustering [31, 112], nets which con-
[90] with two applications: linear placement and tribute to the longest delay are termed critical
performance driven designs. We then show two nets. Pins of the critical net are merged to form
strategies to perform the clustering: maximum clusters.
matching and maximum pairing. We will demon- For a special case that the circuit is a directed
strate via examples the advantage of maximum tree, we can find optimal solution in polynomial
pairing over maximum matching. time. Let us assume the tree has its leaves at the
input and its root at the output. We use a dynamic
(i) Linear Placement For linear placement, we
programming approach to trace from the leaves
reduce the complexity of the problem by a bot-
toward the root. Each module is not traced until
tom up clustering approach [53, 96, 100]. The clus-
all its input modules are processed. For each
tering is based on the result of a tentative
module, we treat it as a root of a subtree and
placement. We adopt a heuristic approach to
find the optimal clustering of the subtree. Since
generate tentative placements throughout itera-
all the modules in the subtree except its root
tions. In each iteration, we cluster modules only
have been processed, we can derive an optimal
when they are in consecutive order of the place-
solution of the root in polynomial time.
ment. We then construct a contracted hypergraph.
In the next iteration, the heuristic approach gen- (iii) Maximum Matching The maximum match-
erates the placement of the contracted hypergraph. ing pairs all modules into IV[/2 groups simulta-
For each iteration, we either grow the size of the neously. Given a measurement of pairing modules,
clusters or construct new clusters adaptively. we can find a matching that maximizes the total
Inspired by the property of the minimum cut pairing measurement in polynomial time.
separating two modules (Theorem 3.1), we use a We can call maximum matching recursively
density as a measure to find the cluster. A density to create clusters of equal sizes. However, this
d(i) at a slot of a linear placement is the total strategy may enforce unrelated pairs to merge. The
connectivity of nets connecting modules on the enforcement will sacrifice the quality of final
different sides of the slot. The following algorithm clustering results.
describes the clustering using a given placement.
Example Figure 30 illustrates the clustering be-
Each cluster size is between L and U.
havior of maximum matching. The circuit con-
Input placement P, two parameters L and U.
tains twelve modules of equal size. The first level
1. Initialize cluster boundary at slot p 1. maximum matching pairs modules (a,b), (d,e),
2. Scan placement P from slot p toward (g,h), (j,k), (c, 1), and (f, i). Modules in the first
the right end. Find slot such that four pairs are strongly connected with their
212 S.-J. CHEN AND C.-K. CHENG
partners. However, the last two are not. Module cut weight 6.6
c and have no common nets but are merged
because their choices are taken by others.
Furthermore, as we proceed to the next level
maximum matching, the merge of pairs (c, l) and
(f, i) will enforce grouping modules into cluster
{a, b, c,j, k, l} and cluster {d, e,f g, h, i}. If we
measure the quality of the results with cluster cost
(expression (26)), the cost of the two clusters is 1.1 1.1
,i((C(Vi))/(C(Vi)))=4/12 + 4/12=2/3. For this (a)
case, we can find a better solution of clusters cut weight 18
{a, b, c, d, e,f} and {g, h, i,j, k, l} of which the
cluster cost is equal to zero.
Figure 31 shows another example of twelve
modules with connectivities attached to the nets.
The connectivity is if not specified. Figure 3 l(a)
shows an optimum cut with cut count 6.6. If a
maximum matching [61] criterion is adopted in the
bottom up clustering approach, then modules
with a net of weight 1.1 between them will be
merged. A minimum cut on the merged modules (b)
yields a cut count of 18 (Fig. 31(b)). In general, FIGURE 31 A twelve module example to demonstrate maxi-
a 2n module circuit having a symmetric configu- mum matching.
ration as in Figure 31 will have a cut count of
n2/2 if the maximum matching criterion is ap- (iv) Maximum Pairing The maximum pairing is
plied to perform the clustering; while the optimum
similar to maximum matching, except that it does
solution will have a cut weight of 1.1 x n. From
not enforce the matching of all modules. Only the
this extreme case, we can claim the following
top q percent of the modules are paired. Thus,
theorem:
we can avoid the enforced pairing of unrelated
THEOREM 5.4 There is no constant factor of error modules.
bound of the cut count generated by the maximum
matching approach, from the cut count of a
However, this strategy may cause certain mod-
ules to keep on growing and produce very un-
minimum cut.
even cluster results. Thus, we need to choose a
Proof As shown in the above example, the factor proper cost function that discourages unlimited
of error bound is (n2/2)/(1.1 x n) n/2.2, which is growth of the cluster size, e.g., cost function
not a constant. Q.E.D. (26).
VLSI PARTITIONING 213
[4] Alpert, C. J., Huang, J. H. and Kahng, A. B., "Multi- [24] Cheng, C. K. and Wei, Y. C. (1991). "An Improved
level circuit partitioning", In: Proc. A CM/IEEE Design Two-Way Partitioning Algorithm with Stable Perfor-
Automation Conf., June, 1997, pp. 530-533. mance", IEEE Trans. on Computer Aided Design, 10(12),
[5] Alpert, C. J. and Kahng, A. B., "Recent directions in 1502-1511.
netlist partitioning: a survey", Integration: The VLSI J., [25] Cheng, C. K. (1992). "The Optimal Partitioning of
19(1), 1-81, August, 1995. Networks", Networks, 22, 297- 315.
[6] Alpert, C. J. and Kahng, A. B., "A general framework [26] Cherng, J. S. and Chen, S. J., "A Stable Partitioning
for vertex orderings with applications to circuit cluster- Algorithm for VLSI Circuits", In: Proc. IEEE Custom
ing", IEEE Trans. VLSI Syst., 4(2), 240-246, June, Integrated Circuits Conf., May, 1996, pp. 9.1.1 9.1.4.
1996. [27] Cherng, J. S., Chen, S. J. and Ho, J. M., "Efficient
[7] Alpert, C. J. and Yao, S. Z., "Spectral partitioning: the Bipartitioning Algorithm for Size-Constrained Circuits",
more eigenvectors, the better", In: Proc. A CM/IEEE lEE Proceedings-Computers and Digital Techniques,
Design Automation Conf., June, 1995, pp. 195-200. 145(1), 37-45, January, 1998.
[8] Bakoglu, H. B., Circuits, Interconnections, and Packaging [28] Cheng, C. K. and Hu, T. C. (1992). "Maximum Con-
for VLSI, MA: Addison-Wesley, 1990. current Flow and Minimum Ratio Cut", Algorithmica,
[9] Blanks, J. (1989). "Partitioning by Probability Conden- 8, 233- 249.
sation", A CM/IEEE 26th Design Automation Conf., [29] Chou, N. C., Liu, L. T., Cheng, C. K., Dai, W. J. and
pp. 758-761. Lindelof, R., "Local Ratio Cut and Set Covering
[10] Bollobas, B. (1985). Random Graphs, Academic Press Partitioning for Huge Logic Emulation Systems", IEEE
Inc., pp. 31 53. Trans. Computer-Aided Design, pp. 1085-1092, Septem-
[11] Boppana, R. B. (1987). "Eigenvalues and Graph ber, 1995.
Bisection: An Average Case Analysis", Annual Symp. [30] Chvatal, V. (1983). Linear Programming, W. H. Freeman
on Foundations in Computer Science, pp. 280-285. and Company.
[12] Breuer, M. A., Design Automation of Digital Systems, [31] Cong, J. and Ding, Y., "FlowMap: An Optimal Tech-
Prentice-Hall, NY, 1972. nology Mapping Algorithm for Delay Optimization in
[13] Bui, T., Chaudhuri, S., Jones, C., Leighton, T. and Lookup-Table Based FPGA Designs", IEEE Trans.
Sipser, M. (1987). "Graph bisection algorithms with good Computer-Aided Design, January, 1994, 13, 1-12.
average case behavior", Combinatorica, 7(2), 171-191. [32] Cong, J., Labio, W. and Shivakumar, N., "Multi-way
[14] Bui, T., Heigham, C., Jones, C. and Leighton, T., VLSI circuit partitioning based on dual net representa-
"Improving the performance of the Kernighan-Lin and tion", In: Proc. IEEE Int. Conf. Computer-Aided Design,
simulated annealing graph bisection algorithms", In: November, 1994, pp. 56-62.
Proc. ACM/IEEE Design Automation Conf., June, 1989, [33] Cong, J., Li, H. P., Lim, S. K., Shibuya, T. and Xu, D.,
pp. 775 778. "Large scale circuit partitioning with loose/stable net
[15] Buntine, W. L., Su, L., Newton, A. R. and Mayer, A., removal and signal flow based clustering", In: Proc.
"Adaptive methods for netlist partitioning", In: Proc. IEEE Int. Conf. Computer-Aided Design, November,
IEEE Int. Conf. Computer-Aided Design, November, 1997, pp. 441-446.
1997, pp. 356-363. [34] Donath, W. E. and Hoffman, A. J. (1973). "Lower
[16] Burkard, R. E. and Bonniger, T. (1983). "A Heuristic Bounds for the Partitioning of Graphs", IBM J. Res.
for Quadratic Boolean Programs with Applications to Dev., pp. 420-425.
Quadratic Assignment Problems", European Journal of [35] Donath, W. E. and Hoffman, A. J. (1972). "Algorithms
Operational Research, 13, 372-386. for partitioning of graphs and computer logic based on
[17] Camposano, R. and Brayton, R. K. (1987). "Partitioning eigenvectors of connection matrices", IBM Technical
Before Logic Synthesis", Int. Conf. on Computer-Aided Disclosure Bulletin 15, pp. 938-944.
Design, pp. 324-326. [36] Donath, W. E. (1988). "Logic partitioning", In: Physical
[18] Chan, P. K., Schlag, D. F. and Zien, J. Y., "Spectral k- Design Automation of VLSI Systems, Preas, B. and
way ratio-cut partitioning and clustering", IEEE Trans. Lorenzetti, M. (Eds.) Menlo Park, CA: Benjamin/
Computer-Aided Design, 13(9), 1088-1096, September, Cummings, pp. 65- 86.
1994. [37] Dutt, S. and Deng, W., "A Probability-based Approach
[19] Charney, H. R. and Plato, D. L., "Efficient Partitioning to VLSI Circuit Partitioning", In: Proc. A CM/IEEE
of Components", IEEE Design Automation Workshop, Design Automation Conf., June, 1996, pp. 100-105.
July, 1968, pp. 16.0-16.21. [38] Dutt, S. and Deng, W., "VLSI Circuit Partitioning by
[20] Chatterjee, A. C. and Hartley, R., "A new Simultaneous Cluster-Removal Using Iterative Improvement Techni-
Circuit Partitioning and Chip Placement Approach ques", In: Proc. IEEE Int. Conf. Computer-Aided Design,
based on Simulated Annealing", In: Proc. A CM/IEEE November, 1996, pp. 194-200.
Design Automation Conf., June, 1990, pp. 36-39. [39] Enos, M., Hauck, S. and Sarrafzadeh, M., "Evaluation
[21] Cheng, C. K. and Kuh, E. S., "Module Placement Based and optimization of Replication Algorithms for logic
on Resistive Network Optimization", IEEE Trans. on Bipartitioning", IEEE Trans. on Computer-Aided Design,
Computer-Aided Design, CAD-3, 218-225, July, 1984. September, 1999, 18, 1237-48.
[22] Cheng, C. K., "Linear Placement Algorithms and Ap- [40] Fiduccia, C. M. and Mattheyses, R. M., "A Linear-Time
plications to VLSI Design", Networks, 17, 439-464, Heuristic for Improving Network Partitions", In: Proc.
Winter, 1987. A CM/IEEE Design Automation Conf., June, 1982,
[23] Cheng, C. K. and Hu, T. C., "Ancestor Tree for pp. 175-181.
Arbitrary Multi-Terminal Cut Functions", Porc. Integer [41] Frankle, J. and Karp, R. M. (1986). "Circuit Placement
Programming/Combinatorial Optimization Conf., Univ. and Cost Bounds by Eigen.vector Decomposition", Proc.
of Waterloo, May, 1990, pp. 115-127. Int. Conf. on Computer-Aided Design, pp. 414-417.
216 S.-J. CHEN AND C.-K. CHENG
[42] Garbers, J., Promel, H. J. and Steger, A. (1990). [60] Kernighan, B. W. and Lin, S., "An Efficient Heuristic
"Finding clusters in VLSI circuits", In: Proc. IEEE Int. Procedure for Partitioning Graphs", Bell Syst. Tech. J.,
Conf. Computer-Aided Design, pp. 520-523. 49(2), 291 307, February, 1970.
[43] Garey, M. R. and Johnson, D. S., Computers and [61] Khellaf, M., "On The Partitioning of Graphs and
Instractability: A Guide to the Theory of NP-Complete- Hypergraphs", Ph.D. Dissertation, Indus. Engineering
ness, W.H. Freeman, San Francisco, CA, 1979. and Operations Research, Univ. of California, Berkeley,
[44] Hagen, L. and Kahng, A. B., "New spectral methods 1987.
for ratio cut partitioning and clustering", IEEE Trans. [62] Kirkpatrick, S., Gelatt, C. and Vechi, M., "Optimization
Computer-Aided Design, 11(9), 1074-1085, September, by Simulated Annealing", Science, 221)(4598), 671-680,
1992. May, 1983.
[45] Hagen, L. and Kahng, A. B., "Combining problem [63] Knuth, D. E., The Art of Computer Programming,
reduction and adaptive multistart: a new technique for Addison Wesley, 1997.
superior iterative partitioning", IEEE Trans. Computer- [64] Kring, C. and Newton, A. R. (1991). "A Cell-Replicating
Aided Design, 16(7), 709-717, July, 1997. Approach to Mincut Based Circuit Partitioning", Proc.
[46] Hall, K. M., "An r-dimensional Quadratic Placement IEEE Int. Conf. on Computer-Aided Design, pp. 2-5.
Algorithm", Management Science, 17(3), 219-229, [65] Krishnamurthy, B., "An Improved Min-Cut Algorithm
November, 1970. for Partitioning VLSI Networks", IEEE Trans. Compu-
[47] Hamada, T., Cheng, C. K. and Chau, P., "An Efficient ters, C-33(5), 438-446, May, 1984.
Multi-Level Placement Technique Using Hierarchical [66] Krupnova, H., Abbara, A. and Saucier, G. (1997). "A
Partitioning", IEEE Trans. Circuits and Systems, 39, Hierarchy-Driven FPGA Partitioning Method", Design
432-439, June, 1992. Automation Conf., pp. 522-525.
[48] Hennessy, J. (1983). "Partitioning Programmable Logic [67] Kuo, M. T. and Cheng, C. K., "A New Network Flow
Arrays Summary", Int. Conf. on Computer-Aided Design, Approach for Hierarchical Tree Partitioning", In: Proc.
pp. 180-181. ACM/IEEE Design Automation Conf., June, 1997,
[49] Hoffmann, A. G., "The Dynamic Locking Heuristic A pp. 512- 517.
New Graph Partitioning Algorithm", In: Proc. IEEE Int. [68] Kuo, M. T., Liu, L. T. and Cheng, C. K., "Network
Symp. Circuits and Systems, May, 1994, pp. 173-176. Partitioning into Tree Hierarchies", In: Proc.
[50] Adolphson, D. and Hu, T. C., "Optimal Linear ACM/IEEE Design Automation Conf., June, 1996,
Ordering", SIAM J. Appl. Math., 25(3), 403-423, pp. 477-482.
November, 1973. [69] Kuo, M. T., Liu, L. T. and Cheng, C. K., "Finite State
[51] Hu, T. C., "Decomposition Algorithm", pp. 17-22, In: Machine Decomposition for I/O Minimization", In:
Combinatorial Algorithms, Addison Wesley, 1982. Proc. IEEE Int. Symp. on Circuits and Systems, May,
[52] Hu, T. C. and Moerder, K., "Multiterminal flows in 1995, pp. 1061 1064.
a hypergraph", In: VLSI Circuit Layout: Theory and [70] Kuo, M. T., Wang, Y., Cheng, C. K. and Fujita, M.,
Design, Hu, T. C. and Kuh, E. (Eds.) NY: IEEE Press, "BDD-Based Logic Partitioning for Sequential Cir-
1985, pp. 87-93. cuits", In: Proc. ASP/DAC, Chiba, Japan, January,
[53] Hur, S. W. and Lillis, J. (1999). "Relaxation and 1997, pp. 607 612.
Clustering in a Local Search Framework: Application [71] Lomonosov, M. V. (1985). "Combinatorial Approaches
to Linear Placement", Design Automation Conference, to Multiflow Problems", Discrete Applied Mathematics,
pp. 360- 366. 11(1), 1-94.
[54] Hwang, J. and Gamal, A. E., "Optimal Replication [72] Landman, B. S. and Russo, R. L., "On a Pin Versus
for Min-Cut Partitioning", Proc. IEEE/ACM Intl. Block Relationship for Partitioning of Logic Graphs",
Conf. Computer-Aided Design, November, 1992, pp. IEEE Trans. on Computers, C-2I), 1469-1479, Decem-
432-435. ber, 1971.
[55] Iman, S., Pedram, M., Fabian, C. and Cong, J., [73] Lawler, E. L., Combinatorial Optimization: Networks
"Finding uni-directional cuts based on physical parti- and Matroids, Holt, Rinehart and Winston, New York,
tioning and logic restructuring", In: Proc. ACM/SIGDA 1976.
Physical Design Workshop, May, 1993, pp. 187-198. [74] Leighton, T. and Rao, S. (1988). "An Approximate
[56] Johnson, D. S., Aragon, C. R., McGeoch, L. A. and Max-Flow Min-cut Theorem for Uniform Multicom-
Schevon, C. (1989). "Optimization by Simulated Anneal- modity Flow Problems with Applications to Approx-
ing: an Experimental Evaluation, Part I, Graph Parti- imation Algorithms", IEEE Symp. on Foundations of
tioning", Operations Research, 37(5), 865-892. Computer Science, pp. 422- 431.
[57] Karp, R. M. (1978). "A Characterization of The [75] Leighton, T., Makedon, F., Plotkin, S., Stein, C.,
Minimum Cycle Mean in A Digraph", Discrete Mathe- Tardos, E. and Tragoudas, S., "Fast Approximation
matics, 23, 309- 311. Algorithms for Multicommodity Flow Problems", Tech.
[58] Karypis, G., Aggarwal, R., Kumar, V. and Shekhar, S., report no. STAN-CS-91-1375, Dept. of Computer
"Multilevel Hypergraph Partitioning: Application in Science, Stanford University.
VLSI Domain", In: Proc. A CM/IEEE Design Automa- [76] Leiserson, C. E. and Saxe, J. B. (1991). "Retiming
tion Conf., June, 1997, pp. 526-529. Synchronous Circuitry", Algorithmica, 6(1), 5 35.
[59] Karypis, G., Aggarwal, R., Kumar, V. and Shekhar, S. [77] Lengauer, T. and Muller, R. (1988). "Linear Arrange-
(1998). "Multilevel Hypergraph Partitioning: Application ment Problems on Recursively Partitioned Graphs",
in VLSI Domain", Manuscript of CS Dept., Univ. Zeitschrift fur Operations Research, 32, 213 230.
of Minnesota, pp. 1-25 (http://www.users.cs.umn.edu/ [78] Lengauer, T., Combinatorial Algorithms for Integrated
karypis/metis/publications/). Circuit Layout, Wiley, 1990.
VLSI PARTITIONING 217
[79] Li, J., Lillis, J. and Cheng, C. K., "Linear decomposition [96] Saab, Y., "A fast and robust network bisection
algorithm for VLSI design applications", In: Proc. IEEE algorithm", IEEE Trans. Computers, 44(7), 903- 913,
Int. Conf. Computer-Aided Design, November, 1995, July, 1995.
pp. 223- 228. [97] Saab, Y. and Rao, V. (1989). "An Evolution-Based
[80] Li, J., Lillis, J., Liu, L. T. and Cheng, C. K., "New Approach to Partitioning ASIC Systems", ACM/IEEE
Spectral Linear Placement and Clustering Approach", 26th Design Automation Conf., pp. 767-770.
In: Proc. A CM/IEEE Design Automation Conf., June, [98] Sanchis, L. A., "Multiple-Way Network Partitioning",
1996, pp. 88- 93. IEEE Trans. Computers, 38(1), 62-81, January, 1989.
[81] Liou, H. Y., Lin, T. T., Liu, L. T. and Cheng, C. K., [99] Sanchis, L. A., "Multiple-Way Network Partitioning
"Circuit Partitioning for Pipelined Pseudo-Exhaustive with Different Cost Functions", IEEE Trans. on Com-
Testing Using Simulated Annealing", In: Proc. IEEE puters, pp. 1500-1504, December, 1993.
Custom Integrated Circuits Con., May, 1994, [100] Schuler, D. M. and Ulrich, E. G. (1972). "Clustering and
pp. 417-420. Linear Placement", Proc. 9th Design Automation Work-
[82] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C., "A shop, pp. 50-56.
Replication Cut for Two-Way Partitioning", IEEE [101] Schweikert, D. G. and Kernighan, B. W. (1972). "A
Trans. Computer-Aided Design, May, 1995, pp. 623-630. Proper Model for the Partitioning of Electrical Circuits",
[83] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C., Proc. 9th Design Automation Workshop, pp. 57-62.
"Performance-Driven Partitioning Using a Replication [102] Sechen, C. and Chen, D. (1988). "An Improved Objec-
Graph Approach", In: Proc. A CM/IEEE Design Auto- tive Function for Mincut Circuit Partitioning", Proc. Int.
mation Conf., June, 1995, pp. 206- 210. Conf. on Computer-Aided Design, pp. 502-505.
[84] Liu, L. T., Kuo, M. T., Huang, S. C. and Cheng, C. K., [103] Shahrokhi, F. and Matula, D. W., "The Maximum
"A gradient method on the initial partition of Fiduccia- Concurrent Flow Problem", Journal of the A CM, 37(2),
Mattheyses algorithm", In: Proc. IEEE Int. Conf. 3"18-334, April, 1990.
Computer-Aided Design, November, 1993, pp. 229-234. [104] Shapiro, J. F., Mathematical Programming: Structures
[85] Liu, L. T., Shih, M., Chou, N. C., Cheng, C. K. and Ku, and Algorithms, Wiley, New York (1979).
W., "Performance-Driven Partitioning Using Retiming [105] Sherwani, N. A., Algorithms for VLSI Physical Design
and Replication", In: Proc. IEEE Int. Conf. Computer- Automation, 3rd edn., Kluwer Academic (1999).
Aided Design, November, 1993 pp. 296-299. [106] Shih, M., Kuh, E. S. and Tsay, R.-S. (1992). "Perfor-
[86] Liu, L. T., Shih, M. and Cheng, C. K., "Data Flow mance-Driven- System Partitioning on Multi-Chip
Partitioning for Clock Period and Latency Minimiza- Modules", Proc. 29th ACM/IEEE Design Automation
tion", In: Proc. A CM/IEEE Design Automation Conf., Conf., pp. 53-56.
June 1994, pp. 658-663. [107] Shih, M. and Kuh, E. S. (1993). "Quadratic Boolean
[87] Matula, D. W. and Shahrokhi, F., "The Maximum Programming for Performance-Driven System Partition-
Concurrent Flow Problem and Sparsest Cuts", Tech. ing", Proc. 30th ACM/IEEE Design Automation Conf.,
Report, southern Methodist Univ., 1986. pp. 761 765.
[88] McFarland, M. C., "Computer-aided partitioning of [108] Shin, H. and Kim, C., "A Simple Yet Effective Tech-
behavioral hardware descriptions", In: Proc. A CM/ nique for Partitioning", IEEE Trans. on Very Large Scale
IEEE Design Automation Conf., June, 1983, pp. 472- Integration Systems, pp. 380-386, September, 1993.
478. [109] Wei, Y. C. and Cheng, C. K. (1991). "Ratio Cut
[89] Motwani, R. and Raghavan, P. (1995). Randomized Partitioning for Hierarchical Designs", IEEE Trans. on
Algorithms, Cambridge University Press. Computer-Aided Design, 10(7), 911 921.
[90] Ng, T. K., Oldfield, J. and Pitchumani, V., "Improve- [110] Wei, Y. C., Cheng, C. K. and Wurman, Z., "Multiple
ments of a mincut partition algorithms", In: Proc. IEEE Level Partitioning: An Application to the Very Large
Int. Conj Computer-Aided Design, November, 1987, Scale Hardware Simulators", IEEE Journal of Solid
pp. 470-473. State Circuits, 26, 706-716, May, 1991.
[91] Nijssen, R. X. T., Jess, J. A. G. and Eindhoven, T. U., [111] Woo, N. S. and Kim, J. (1993). "An Efficient Meth-
"Two-Dimensional Datapath Regularity Extraction", od of Partitioning Circuits for Multiple-FPGA Imple-
Physical Design Workshop, April, 1996, pp. 111 117. mentation", Proc. A CM/IEEE Design Automation
[92] Parhi, K. K. and Messerschmitt, D. G. (1991). "Static Conf., pp. 202-207.
Rate-Optimal Scheduling of Iterative Data-Flow Pro- [112] Yang, H. and Wong, D. F. (1994). "Edge-Map: Optimal
grams via Optimum Unfolding", IEEE Trans. on Performance Driven Technology Mapping for Iterative
Computers, 40(2), 178-195. LUT Based FPGA Designs", Int. Conf. on Computer- A
[93] Riess, B. M., Doll, K. and Johannes, F. M., "Partition- Aided Design, pp. 150-155.
ing very large circuits using analytical placement [113] Yang, H. and Wong, D. F., "Efficient Network Flow
techniques", In: Proc. A CM/IEEE Design Automation based Min-Cut Balanced Partitioning", In: Proc. IEEE
Conf., June, 1994, pp. 646-651. Int. Conf Computer-Aided Design, November, 1994,
[94] Roy, K. and Sechen, C., "A Timing Driven N-Way Chip pp. 50- 55.
and Multi-Chin Partitioner", Proc. IEEE/ACM Int. [114] Yeh, C. W., "On the Acceleration of Flow-Oriented
Conf on Computer-Aided Design, pp. 240-247, Novem- Circuit Clustering", IEEE Trans. Computer-Aided De-
ber, 1993. sign, 14(10), 1305-1308, October, 1995.
[95] Russo, R. L., Oden, P. H. and Wolff, P. K. Sr., "A [115] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., "A general
heuristic procedure for the partitioning and mapping of purpose, multiple-way partitioning algorithm", IEEE
computer logic graphs", IEEE Trans. on Computers, Trans. Computer-Aided Design, 13(12), 1480-1488,
C-20, 1455-!462, December, 1971. December, 1994.
218 S.-J. CHEN AND C.-K. CHENG
[116] Yeh, C. W., Cheng, C. K. and Lin, T. T.Y., the Association for Computing Machinery, the
"Optimization by iterative improvement: an experimen-
tal evaluation on two-way partitioning", IEEE Trans. IEEE, and the IEEE Computer Society.
Computer-Aided Design, 14(2), 145-153, February, Chung-Kuan Cheng received the B.S. and M.S.
1995.
[117] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., "Circuit degrees in electrical engineering from National
clustering using a stochastic flow injection method", Taiwan University, and the Ph.D. degree in elec-
IEEE Trans. Computer-Aided Design, 14(2), 154-162,
February, 1995. trical engineering and computer sciences from
[118] Zien, J. Y., Chan, P. K. and Schlag, M., "Hybrid University of California, Berkeley in 1984. From
spectral/iterative partitioning" In: Proc. IEEE
Int. Conf. Computer-Aided Design, November, 1997 1984 to 1986 he was a senior CAD engineer at
pp. 436-440. Advanced Micro Devices Inc. In 1986, he joined
the University of California, San Diego, where
he is a Professor in the Computer Science and
Authors’ Biographies
Engineering Department, an Adjunct Professor
Sao-Jie Chen has been a member of the faculty in in the Electrical and Computer Engineering
the Department of Electrical Engineering, Na- Department. He served as a chief scientist at
tional Taiwan University since 1982, where he is Mentor Graphics in 1999. He is an associate editor
currently a full professor. During the fall of 1999, of IEEE Trans. on Computer Aided Design since
he held a visiting appointment at the Department 1994. He is a recipient of the best paper award,
of Computer Science and Engineering, University IEEE Trans. on Computer-Aided Design 1997,
of California, San Diego. His current research the NCR excellence in teaching award, School of
interests include: VLSI circuits design, VLSI Engineering, UCSD, 1991. His research interests
physical design automation, and object-oriented include network optimization and design automa-
software engineering. Dr. Chen is a member of tion on microelectronic circuits.
International Journal of
Rotating
Machinery
International Journal of
The Scientific
Engineering Distributed
Journal of
Journal of
Journal of
Control Science
and Engineering
Advances in
Civil Engineering
Hindawi Publishing Corporation Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
Journal of
Journal of Electrical and Computer
Robotics
Hindawi Publishing Corporation
Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
VLSI Design
Advances in
OptoElectronics
International Journal of
International Journal of
Modelling &
Simulation
Aerospace
Hindawi Publishing Corporation Volume 2014
Navigation and
Observation
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
in Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Engineering
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2010
Hindawi Publishing Corporation
http://www.hindawi.com
http://www.hindawi.com Volume 2014
International Journal of
International Journal of Antennas and Active and Passive Advances in
Chemical Engineering Propagation Electronic Components Shock and Vibration Acoustics and Vibration
Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014