Professional Documents
Culture Documents
HIGH-PERFORMANCE ENERGY-EFFICIENT
DUAL-SUPPLY ALU DESIGN
Sanu K. Mathew, Mark A. Anders, and Ram K. Krishnamurthy
Circuits Research Laboratories, Intel Corporation, Hillsboro, OR, USA
Abstract:
This chapter describes the design of a single-cycle 64-bit integer execution ALU
fabricated in 90 nm dual-Vt CMOS technology, operating at 4 GHz in the 64bit mode with a 32-bit mode latency of 7 GHz (measured at 1.3 V, 25 C). The
lower- and upper-order 32-bit domains operate on separate off-chip supply voltages, enabling conditional turn-on/off of the 64-bit ALU mode operation and
efcient power-performance optimization. High-speed single-rail dynamic circuit techniques and a sparse-tree semi-dynamic adder core enable a dense layout
occupying 280 260 m2 while simultaneously achieving (i) low carry-merge
fan-outs and inter-stage wiring complexity, (ii) low active leakage and dynamic
power consumption, (iii) high DC noise robustness with maximum low-Vt usage,
(iv) single-rail dynamic-compatible ALU write-back bus, (v) simple 2 50%
duty-cycle timing plan with seamless time-borrowing across phases, (vi) scalable
64-bit ALU performance up to 7 GHz measured at 2.1 V, 25 C, and (vii) scalable
32-bit ALU performance up to 9 GHz measured at 1.68 V, 25 C.
Key words:
1.
Introduction
Fast 32-bit and 64-bit arithmetic and logic units (ALU) with single-cycle
latency and throughput are essential ingredients of high-performance superscalar integer and oating-point execution cores. Furthermore, in a typical
172
ALU operation, the lower-order 32 bits of the ALU output are required
early for address generation and rapid back-to-back operations [1]. These
constraints require a high-performance ALU with a compact layout footprint that minimizes interconnect delays in the core. ALU also contribute
to one of the highest power-density locations on the processor, resulting in
thermal hotspots and sharp temperature gradients within the execution core.
The presence of multiple execution engines in current-day processors [1]
further aggravates the problem, severely impacting circuit reliability and
increasing cooling costs. Therefore, this strongly motivates energy-efcient
ALU designs that satisfy the high-performance requirements, while reducing peak and average power dissipation. Traditional dense-tree adder architectures such as KoggeStone [2] use full binary carry-merge trees that
result in large transistor sizes because of their high fan-outs and require
wide routing channels for inter-stage wiring. Adder architectures such as
LadnerFischer [3] address the wiring problem by reducing the number of interstage interconnects at the expense of exponentially increasing carry-merge
fan-outs.
In this chapter a single-cycle 64-bit integer execution ALU [4] fabricated in
90 nm dual-Vt CMOS technology [5] is described. A sparse-tree adder architecture is employed to address the high fan-out issue mentioned above, as well as to
reduce the inter-stage wiring complexity by up to 80% [6]. High-speed singlerail dynamic circuit techniques and migration of non-critical paths to fully
static CMOS enable low carry-merge fan-outs, low active leakage and dynamic
power consumption, high DC noise robustness, and a dense layout. The complete 64-bit ALU operates at 4 GHz in the 64-bit mode measured at 1.3 V, 25 C
and consumes 300 mW total power. The corresponding 32-bit mode latency
is 7 GHz, also measured at 1.3 V, 25 C. The lower- and upper-order 32-bit
domains operate on separate off-chip supply voltages, enabling conditional
turn-on/off of the 64-bit ALU mode operation and efcient power-performance
optimization, resulting in up to 22% power savings. The 64-bit ALU performance is scalable up to 7 GHz measured at 2.1 V, 25 C, and the 32-bit ALU
performance is scalable up to 9 GHz measured at 1.68 V, 25 C. Burn-in tolerant
conditional keepers are inserted on all dynamic gates to enable full functionality under worst-case noise conditions and elevated supply/temperature stress
tests [7].
The remainder of this chapter is organized as follows: Section 2 describes
the organization of the ALU; Sections 3 and 4 present the key circuits that
enable an energy-efcient single-rail ALU design; the ALU clocking scheme
and timing plan are described in Section 5; Section 6 discusses the benets of
this design over a conventional dual-rail dynamic implementation; Section 7
presents the 90 nm dual-Vt CMOS implementation and silicon measurement
results; the dual-supply voltage operation of the ALU is described in Section 8.
Finally the chapter is summarized in Section 9.
173
2. ALU Organization
The integer execution ALU operates on inputs that originate from the integer
register le, the L0 cache, bypass cache, and ALU write-back bus results. A 5:1
source multiplexer selects a pair of these operands and delivers them as inputs to
the single-rail adder/logical unit core. The output of the adder shares a bus driver
with the logical unit, and drives a 110 m write-back bus that sends the ALU
outputs back to its inputs. The ALU is designed for operation in both 32-bit and
64-bit modes. This organization (Figure 1) enables single-cycle, back-to-back
execution of 32/64-bit Add, Subtract, Accumulate and Logic instructions.
3.
174
required to prevent the turned on NMOS device (N0) from falsely discharging
the output node. Careful sizing of the cross-coupled PMOS keepers (P0 and
P1) limits the switching noise on the output node. Figure 3 shows the impact
of clock skew on peak noise-glitch at the non-switching dynamic output node.
Noise constraints of this design require the maximum noise-glitch to be less
than 75 mV. Simulations in 90 nm CMOS technology show that an 18 ps setup
time at the 1 boundary enables this circuit to tolerate up to 15 ps clock
variation about the nominal point, while adequately meeting all propagated
noise constraints. The single-to-dual-rail converting source multiplexer circuit
eliminates the need for a differential write-back bus, thereby contributing to
the energy-efciency and compact layout of this design.
4.
4.1.
175
Figure 3. Source-multiplexer peak output noise vs. input data arrival time trade-off.
NAND Bi #) and generate (Gi = Ai # NOR Bi #) signals from the adder inputs
Ai # and Bi #. Complementary version of all ALU inputs, except for the writeback, bus are available at the front-end multiplexer. As described in Section 3,
the complementary version of the writeback bus result can be easily generated using the single-to-dual rail converter circuit. The use of complementary
adder inputs results in single-transistor pull-up (GPi # = Pi NAND Pi1 ) and
pull-down (GP = Pi # NOR Pi1 #) evaluation paths in the static and dynamic
176
4.2.
177
the sparse-tree (C3 , C7 , . . . , C23 and C27 ) then select the appropriate four-bit
conditional sums using a 2:1 multiplexer, delivering the nal 32-bit sum. In
this way logic traditionally implemented in the main carry-tree using expensive
parallel prex logic is implemented in the sparse-tree design using an energyefcient architecture. Such an approach results in smaller area, reduced energy
consumption and lower leakage.
The conditional sum-generator block can be further optimized by exploiting
correlation of the inputs to the rst-level conditional carry gates. As shown in
Figure 6, both rst-level conditional carry-merge gates reduce to inverters,
which can be merged with the next gate in the ripple carry chain, reducing
the number of stages in the non-critical path from ve to four. This additional
slack between the critical and non-critical paths may be further exploited to
save power.
4.3.
Semi-dynamic Implementation
The performance criticality of the ALU demands a dynamic adder implementation. Partitioning the carry-merge tree into critical and non-critical sections enables an energy-efcient implementation by leveraging dynamic and
static techniques. Figure 7 shows the critical and non-critical sections of the
adder core. The critical path, implemented in single-rail dynamic logic, begins
with the two-stage source multiplexer (A dynamic 5:1 mux with a static (S)
NAND gate merging the true and complementary signals, as shown in Figure 2),
followed by the PG block that outputs the Pi and Gi signals from the inputs Ai #
and Bi #. At this point the critical and non-critical paths fork, with the ve-stage
178
sparse-tree (CM1-CM5) in the critical path and the four-stage conditional sum
generator (two stages of conditional ripple-carry gates, a sum XOR and an
inverter) in the non-critical path. The nal 1 in 4 carry (C27 # ) from the critical path selects between the two conditional sums (Sum#31 0 and Sum#31 1 )
using a 2:1 transmission gate multiplexer. The one-stage slack enables reduced
transistor sizes in the non-critical path, resulting in additional power savings.
To meet the performance requirement of the ALU, the critical path is implemented in single-rail dynamic logic. Compared to other dual-rail dynamic
implementations [9], this 50% reduction in transistor count and interconnect
179
complexity results in reduced leakage power and higher performance. The critical sparse-tree path includes the PG block and carry-merge gates implemented
in dynamic logic. The non-criticality of the sum generator allows a completely
static CMOS logic implementation. The low switching activity of static gates
reduces the average power consumption in the ALU.
The inputs (Gi , Pi , and Gi1 ) to the conditional carry gate (Figure 8) of
the static sum generator are dynamic signals that go high during precharge. To
prevent this precharge activity from propagating through the sum generators
when the clock goes low, the rst gate in the sum generator is converted to a setdominant latch (Figure 8) by the addition of the clocked footer NMOS device
to the static carry-merge gate, clocked with stclk2, a staggered 1 clock,
as described in Section 5. This transistor cuts off the discharge path for the
output during the precharge phase, while a full keeper added to the output node
holds state. Thus switching activity in the downstream static blocks is reduced,
resulting in a semi-dynamic design that reduces average power consumption
without impacting performance.
The interface between the static and dynamic signals occurs at the 2:1
transmission-gate multiplexer, with the gates CM4 and CM5 representing the
fourth and fth level carry-merge gates of the sparse tree (Figure 9). The Carry#
signal is a dynamic signal (pre-discharged low) that drives the select node of the
multiplexer. Inputs to the transmission-gate multiplexer are static conditional
180
sum signals from the side-paths. During precharge, the lower transmission
gate is turned ON, and the output sum is equal to Sum1. During evaluation,
Carry# may remain low or go high in response to activity in the dynamic carrymerge gates. The multiplexer will select either of the conditional sums (Sum#0
or Sum#1) to deliver the nal sum, enabling seamless time-borrowing at the
dynamicstatic interface with no race conditions between the two paths.
181
4.4.
In the 64-bit mode the supply to the upper 32 bits of the adder is activated
(Figure 10). To enable layout reuse of the 32-bit adder cores, the upper 32-bit
section uses a similar sparse-tree to generate 32-way group-generates and propagates. The carryout from the lower-order 32 bits is merged with these signals
in the additional carry-merge stage (CM6), generating every fourth carry of the
upper 32 bits (Figure 11). This architecture reduces design/layout effort and
produces the nal 64-bit ALU outputs with only three additional gate stages
while minimally affecting the performance of the lower 32 bits of the adder.
182
The timing plan (Figure 13) illustrates the relative positioning of the ALU
clocks. The phase clocks 1 and 2 drive footed dynamic gates in the source
multiplexer and the fourth level carry-merge gate respectively. The remaining
carry-merge gates are implemented using footless dynamic circuits driven by
locally generated staggered clocks. Generation of these staggered clocks from
the phase clocks limits precharge races. Seamless time-borrowing occurs at all
clock boundaries except at the 1 boundary, which requires a setup time of
18 ps to minimize noise generated at the single-to-dual-rail converting multiplexers output.
6.
183
7.
Measurement Results
8.
Depending on the mode of operation, the upper 32 bits of the ALU can be
selectively turned on or turned off (Figure 16). In the 32-bit mode the supply to
184
Figure 15. 32-bit and 64-bit ALU maximum frequency and power measurements.
the upper section is turned off, thereby reducing active leakage power by 50%.
In the 64-bit mode the three-stage slack present in the upper-order 32-bit section
of the tree is exploited by lowering its supply, resulting in lower dynamic power
consumption, without impacting performance. Low-voltage operation of the
ALU also provides the additional benet of reducing leakage power. Leakage
current measurements of a 64-b ALU in 90 nm CMOS technology (Figure 17)
show that a 66% active leakage power reduction is achieved by reducing the
supply voltage from 1.3 V down to 1 V. By operating the upper 32-bit section of
the ALU at this lower supply in the 64-bit mode we can obtain a 33% leakage
185
Figure 17. 90 nm measured results: scaling of 64-bit ALU active leakage with supply
voltage.
186
Figure 18. Total power savings and performance penalty vs. supply trade-off for upper-order
32 bits.
9.
Acknowledgments
The authors thank R. Saied, S. Wijeratne, D. Chow, M. Kumashikar,
C. Webb, G. Taylor, I. Young, H. Samarchi, G. Gerosa, for discussions; Desktop
Platforms Group, Folsom, CA for layout support; and S. Borkar, M. Haycock,
J. Rattner and S. Pawlowski for encouragement and support.
187
References
[1] Sager, D. et al. A 0.18 m CMOS IA32 microprocessor with a 4 GHz integer execution unit, Digest of Tech. Papers, IEEE Intl. Solid-State Circuits Conf., February
2001, 324325.
[2] Kogge, P.; Stone, H.S. A parallel algorithm for the efcient solution of a general class
of recurrence equations, IEEE Trans. on Computers, 1973, c22, 786793.
[3] Knowles, S. Afamily of adders, Proc. 14th IEEE Intl. Symp. on Computer Arithmetic,
April 1999, 277281.
[4] Mathew, S.; Anders, M.; Bloechel, B.; Nguyen, T.; Krishnamurthy, R.; Borkar, S.
A 4 GHz 300 mW 64-bit integer execution ALU with dual supply voltages in 90 nm
CMOS, Digest of Tech. Papers, IEEE Int. Solid-State Circuits Conf., February 2004,
162163.
[5] Thompson, S. et al. A90 nm logic technology featuring 50 nm strained silicon channel
transistor, 7 layer of Cu interconnects, low-k ILD, 1 m2 SRAM cell, IEDM Tech.
Dig., December 2002, 6164.
[6] Mathew, S.; Anders, M.; Krishnamurthy, R.; Borkar, S. A 4 GHz 130 nm address
generation unit with 32-bit sparse-tree adder core, IEEE J. Solid State Circuits, 2003,
38, 689695.
[7] Alvandpour, A.; Krishnamurthy, R.; Borkar, S. A sub-130 nm conditional keeper
technique, IEEE J. Solid State Circuits, 2002, 37, 633638.
[8] Anders, M.; Mathew, S.; Bloechel, B. et al. A 6.5 GHz 130 nm single-ended dynamic
ALU and instruction scheduler loop, Dig. Tech. Papers, IEEE Int. Solid-State Circuits
Conf., February 2002, 410411.
[9] Naffziger, S. A sub-nanosecond 0.5 m 64-bit adder design, Dig. Tech. Papers,
IEEE Int Solid-State Circuits Conf., February 1996, 362363.
[10] Alvandpour, A.; Krishnamurthy, R.; Eckerbert, D.; Apperson, S.; Bloechel, B.;
Borkar, S. A 3.5 GHz 32 mW 150 nm multiphase clock generator for highperformance microprocessors, Dig. Tech. Papers, IEEE Int Solid-State Circuits Conf.,
February 2003, 112113.