Carry Save Adder Trees in Multipliers: ECEN 6263 Advanced VLSI Design

E C E N 6 2 6 3 A d v a n c e d V L S I D e s i g n
Carry Save Adder Trees in Multipliers

A naive organization for adding the partial products together in an array multiplier is to
accumulate the partial products in each row and pass the accumulated partial products
(using carry save adders) to the next row
PPG
PPG
CSA
PPG
CSA
PPG
CSA
This makes the total delay going through the carry save proportional to N, the number of
bits in the data words (there are N rows or N/2 rows if Booth recoding is used.) This delay
is too long if we use a fast adder for the last row which as delay ~ log N.
The easiest way to reduce the CSA delay is to make a tree structure instead of the linearly
connected naive design. In general, we want an efficient way to add N (or N/2) partial
products. Let us examine the problem of adding 1 bit of each of the N products (this is
called a “bit-slice”.
Example 4 partial products (8x8 Booth)

PP4i PP3i PP2i PP1i
CSA
CSA
to fast adder
Carry Save Adder Trees in Multipliers November 3, 1999 page 1 of 8

The horizontal lines indicate that the carry out must be passed to the next most significant
bit slice. Since we make each bit slice the same, there will also be a corresponding carry-
in. A tree of carry save adders is called a “Wallace tree”. Note that only 2 levels of CSA’
are needed in the Wallace tree for 4 partial products whereas the linear structure requires 3
levels. There is an even greater savings for larger multipliers. For example 8 partial prod-
ucts (16x16 Booth)
PP8i PP7i PP6i PP5i PP4i PP3i PP2i PP1i
3,2 3,2
3,2 3,2
4,2
to fast adder
We have now used a notation where the CSA is regarded as a general element adding three
inputs and putting out two binary weighted outputs. Thus, the (4,2) element is simply the
design we already did for the 4 partial product example with four equally weighted inputs
and two binary weighted outputs. The (4,2) element has recently become a popular build-
ing block for CSA trees. This is because there are 1/2 as many outputs as inputs to the
(4,2) element. Thus, binary trees may be constructed which make the layout problem
somewhat easier.

Example (8,2) out of (4,2) blocks
PP8i PP7i PP6i PP5i PP4i PP3i PP2i PP1i
4,2
4,2
4,2
If we decompose each (4,2) block into (3,2) elements, there are 2 stages of (3,2) elements
in each (4,2) block. This gives a total of 4 stages for 8 partial products in both designs
(same delay) which is much better than the 7 required for the linear structure. The layout
for the tree with (4,2) building blocks is probably more straightforward than using (3,2)
elements directly as building blocks. Remember, each bit slice of the tree must be
repeated N times. Thus, the layout of the (8,2) structure from (4,2) building blocks might
look like:

PP1
PP2
(4,2)
PP3
PP4
(4,2)
PP5
PP6
(4,2)
PP7
PP8
Note that the tree interconnect must be provided for each bit slice. Note also that the car-
ries do not ripple through the (4,2) blocks in the horizontal direction (the horizontal carry
out is independent of the horizontal carry-in). Rather, the critical delay path follows the
vertical tree structure interconnect. Finally, note that although there are 3 (4,2) blocks, the
delay from any PP to an output goes through only 2 of the (4,2) blocks. If we extend this
design to larger numbers or partial products, the delay is proportional to log2 (no. of PP’s).
This is the basis for the high speed of tree-based designs.
Higher Order Building Blocks
The carry save adder can be regarded as the most general circuit that takes some number
of inputs and represents the sum of the inputs in 2 binary weighted output bits. We can
extend this idea into larger building blocks that represent binary weighted sums in 3, 4 or
more bits.
(3,2): most inputs in 2 output bits
These high order building blocks can greatly simplify global routing.

For example (16,2) out of (7,3) blocks.
PP16i PP15i PP14i PP13i PP12i PP11i PP10i PP9i PP8i PP7i PP6i PP5i PP4i PP3i PP2i PP1i
7,3
7,3
4,2
7,3











shift shift
shift 2 shift 1
2 1
Such a diagram is much harder to draw out of (3,2) blocks, and the routing is correspond-
ingly more complicated. Use of the (7,3) block organizes the routing and makes it more
efficient even if the (7,3) block is constructed from (3,2) blocks! (More on this later).
A similar organization is obtained using (15,4) as a building block. However, it is difficult

to easily route the 15 inputs to this single block (it requires numerous feed throughs in the
partial product generator!). The number of inputs can be reduced by making some of the
inputs have higher binary weight. For example, a (5,5,4) block is an alternative to the
(15,4) block. The (5,5,4) block requires 50% less feed throughs than the (15,4) block.
weight 1
weight 2 weight 1

















15,4 5,5,4
wt8 wt4 wt2 wt1 wt8 wt4 wt2 wt1
One might try to implement the large building blocks by developing logic equations for
each output and synthesizing the corresponding logic gates. This results is complex logic
functions with a high fan out from the circuit inputs (the same input occurs in many differ-
ent terms of the logic equation). This causes the implementation to be too slow and take
up too much area. The estimated the no. of FET’s ~ (no. of inputs)4! It seems more effi-
cient to implement the larger building blocks out of smaller building blocks. If this pro-
cess is repeated for the smaller blocks, then ultimately all the blocks can be made from the
(3,2) element. For example
3,2 3,2
(2) (1) (2) (1) Internal wiring
keeps track of
proper binary weight
3,2
(2) (1)
3,2
(4) (2)
wt4 wt2 wt1 (7,3)

We still gain efficiency in the design by using the (7,3) block since it
1. simplifies global routing

2. allows re-use of the layout and routing already done for the (7,3) when multiple
(7,3) are used.
The higher order building blocks discussed so far do not have any “hidden” carries like the
(4,2) block discussed earlier. It is also possible to make higher order blocks with “hidden”
carry-outs that do not depend on the “hidden” carry-ins.
A (6,2) block and a (9,2) block can be made.
3,2 3,2
4,2
(6,2)
3,2 3,2 3,2
6,2
(9,2)
This idea can be extended to arbitrarily large blocks. Note that none of the carries “ripple”
across from right to left. Rather the carries propagate down and to the left.

One should try to choose a block size or hierarchy of block sizes that fits the intended
design. For example, consider a design for the 53-bit by 53-bit multiplier for the mantissa
of the IEEE 744 floating point standard. First, Booth recording is used to reduce the num-
ber of partial products to 27. The needed (27,2) tree can be implemented as follows.
9,2
9,2
9,2
6,2
Since 27 is a multiple of 9, the (9,2) building block gives a very simple global routing
structure for this multiplier. But it would not be appropriate for a 16 or 32 bit multiplier
for example.

Carry Save Adder Trees in Multipliers: ECEN 6263 Advanced VLSI Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Carry Save Adder Trees in Multipliers: ECEN 6263 Advanced VLSI Design

Uploaded by

Copyright:

Available Formats

E C E N 6 2 6 3 A d v a n c e d V L S I D e s i g n

Carry Save Adder Trees in Multipliers

Example 4 partial products (8x8 Booth)

Carry Save Adder Trees in Multipliers November 3, 1999 page 1 of 8

PP8i PP7i PP6i PP5i PP4i PP3i PP2i PP1i

Carry Save Adder Trees in Multipliers November 3, 1999 page 2 of 8

Example (8,2) out of (4,2) blocks

PP8i PP7i PP6i PP5i PP4i PP3i PP2i PP1i

Carry Save Adder Trees in Multipliers November 3, 1999 page 3 of 8

Higher Order Building Blocks

(3,2): most inputs in 2 output bits

(7,3): most inputs in 3 output bits

(15,4): most inputs in 4 output bits

Carry Save Adder Trees in Multipliers November 3, 1999 page 4 of 8

A similar organization is obtained using (15,4) as a building block. However, it is difficult

wt8 wt4 wt2 wt1 wt8 wt4 wt2 wt1

wt4 wt2 wt1 (7,3)

Carry Save Adder Trees in Multipliers November 3, 1999 page 6 of 8

1. simplifies global routing

A (6,2) block and a (9,2) block can be made.

3,2 3,2 3,2

Carry Save Adder Trees in Multipliers November 3, 1999 page 7 of 8

Carry Save Adder Trees in Multipliers November 3, 1999 page 8 of 8

You might also like