You are on page 1of 4

A 90-nm CMOS Embedded

Low Power SRAM Compiler


Zhao-Yong Zhang, Chia-Cheng Chen * , and Jian-Bin Zheng
Abstract - In this paper a highly flexible low power single
port Static Random Access Memory (SRAM) compiler design
is presented. The Divided Word Line (DWL) and Divided Bit
Line (DBL) scheme were implemented for reducing active
power. Particular emphasis was put to decrease standby
power consumption in word line driver. The forced-stack
devices as pulse generation element was introduced for
sensing enable. This guarantees SRAM can work in low
voltage without losing design margin. A test-chip with 17
embedded SRAMs has been fabricated in UMC 90-nm low
leakage CMOS logic process',
We use the self-timing and replica techniques in the SRAM
circuits design for different memory density, which will give
the SRAM compiler low power and high speed results with its
advantageous characters.
The organization of this paper is as follows. In Section II, a
brief overview of the architecture and the replica self-timing
technique are discussed. Section III discusses the design of
some SRAM circuits including word line driver, sensing
enable pulse generation. Section IV presents the experimental
results on the performance of the test-chip. Section V gives
conclusion of the paper.
II. ARCHITECTURE
The SRAM is synchronous single port memory. Array core
uses a 6T high threshold voltage SRAM cell with 0.9991lm2
area, which owned by DMC for 90-nm low leakage CMOS
logic process. Different combinations of words, bits, and
aspect ratios BM (Block Multiplexing) can be used to
generate the most desirable configuration. Table I shows the
configuration information of the SRAM compiler.
TABLE I
SINGLE-PORT SRAM COMPILER CONFIGURATION
The SRAM can be organized as containing one to eight
banks (1 to 32 sub-arrays) in a memory by utilizing DWL and
DBL techniques. Each memory array can have maximum of
256 rows (word lines) and 256 columns (bit lines). Fig. 1
presents an architecture diagram by using DWL=2 and
DBL=2 as example. The control circuit and block selector are
used to select one of two blocks (BSI, BS2). Each block
memory array is divided into left and right to decrease access
time [9] as well as data line (SAODT) tiling wire. The DBL
scheme [5], [8] is embedded in the SRAM compiler, which
can divide the bit line into one to four partitions by following
the rule listed in table I. In Fig. 1 the X address highest bit is
used for generating bit line partition signals (BPI, BP2),
which makes bit line be divided into two partitions. So the
SRAM array finally is spit into eight sub-arrays by using
DWL and DBL techniques. The local bit line multiplexing
circuit (column Mux.) is used to connect local bit line to
global bit line, then to the input of sense amplifier.
Index Terms - Low power, SRAM compiler, divided word
line, divided bit line, forced-stack device, part power-gating,
replica technique, self-timing.
I. INTRODUCTION
With the scaling of CMOS transistor, a larger fraction of
chip area is devoted to the embedded SRAM modules.
Simultaneously, the need of lighter portable electronic
applications with extended battery life has made low power
memory circuit design become more and more necessary and
important. SRAM compiler product as a highly flexible
memory generation system can meet many increased demands
of SOC designer for compact, fully diffused embedded
memories [2], [6], [9].
There are numerous techniques to reduce the SRAM power
dissipation [1], [3]-[11]. The DWL [1] and DBL [5], [8]
techniques were employed by our SRAM compiler to reduce
the active power. In this paper a new word line driver circuit
with part power-gating scheme [10] is presented, which can
greatly decrease the standby current. Not only the PVT
(Process, Voltage, and Temperature) conditions will affect the
performance of SRAM generated by compiler, but also the
parameters of configuration or density will affect the final
performance. So a circuit designer of SRAM compiler must
build up a reasonable architecture to deal with those variations.
We use memory cells as replica circuits [4] to minimize the
effect of operating and configuration conditions' variability on
the speed and power.
A latch-type sense amplifier is used in our SRAM compiler,
which can give a good result in terms of speed and power
because it is able to amplify a very low bit line swing voltage.
Zhao- Yong Zhang is with the Memory Design Department, AiceStar
Technology Corporation, Suzhou, China (e-mail: brightzhang@aicestar.com).
Chia-Cheng Chen is with the Module Intellectual Property Development,
Faraday Technology Corporation, Hsin-chu City, Taiwan (e-mail:
ccchen@faraday-tech.com).
Jian-Bin Zheng is with the Memory Design Department, AiceStar
Technology Corporation, Suzhou, China (e-mail: rickzheng@aicestar.com).
978-1-4244-3870-9/09/$25.00 2009 IEEE
Parameter
Words
Bits
Bytes
Bit line partitions (DBL)
Aspect ratios (DWL)
Ranges
64b to 32Kb, increment=BMx 16
1b to 128b, increment=l
128b to 1b, decrement=l
1 when 16 ::;WL ::;256
2 when 256 < WL ::;512
3 when 512 < WL ::;768
4 when 768 < WL ::; 1024
Block Mux. (BM) 1, 2, 4, 8
625
The aspect ratio for each configuration can be controlled by
selecting one of four different block multiplexing (BM)
schemes (namely DWL): 1:1, 2:1, 4:1, and 8:1 ratios. The
local bit line multiplexing all are 4:1 in four BM options.
The dual-threshold voltage technique [11] is used to trade
off the high speed and low power operation in the SRAM
compiler. Memory cells use high threshold voltage and
peripheral circuits use regular threshold voltage.
RL Replica word line RL Replica word line Replica word line RL Replica word line RL Replica Global WLDRV
III. CIRCUIT DESIGN
The design of an SRAM consists of three major blocks, the
design of the memory cell, the decoder circuits and the sense
amplifiers. In the following sections, we will have a brief
overview of row decoders and sensing enable circuit. Because
the memory cell is owned by UMC, here we will ignore the
design.
A. Row Decoders
Row decoders are used to assert the word lines based on the
input addresses. The decoder structure mainly consists of an
initial pre-decoder circuit and a word line decoder circuit. Fig.
3 shows an 8 to 256 decoder structure designed for global
word line decoder.
PAO PA7 PC7 pco PB3 PBO
LV Column Muxs
(4:1)
Column Muxs
(4:1)
LV BP2- Global Column Decoder LV
Global WLDRV
Q ----------------
Global WLDRV =
---------------- 5
"C o"C


e ";
j
Column Muxs
(4:1)
ColumnMuxs
(4:1) LV
DGBL(DGBLB)
Block Select (BS)
GWL7
GWLO
,-- ----- --I
: :
: 3t08 :
i
I ,

AX2 AX! AXO
WLDRV8 I
I
___________________________
I
.. L __! :

I
I
I
!..- -- -- --
;-- ----- --I r- ----- --I
i 3 to 8 ;: 2 to 4 i
i predecoder i :predecode:
I I I I
L_T--r--J
AX7 AX6 AX5 AX4 AX3
Fig. 3. Block diagram of global word line decoder with part power-gating.
In the pre-decoder stage, the address inputs AXO - AX2
and internal clock (ICLK) are combined using a 3 to 8 CMOS
static pre-decoding circuit. The ICLK is a self-reset pulse
signal (refer to Fig. 2), which can make the asserted word line
also be a pulse. Other address inputs AX3 - AX? are used to
generate pre-decoding signals PB and PC. The two sets of pre-
decoder outputs are then combined to give the outputs which
drive the NAND gate (PBCO- PBC31) and part power-gating
PMOS (PBCOB- PBC31B). Before asserting a word line the
PB and PC must firstly be enabled for avoiding global word
line glitch. Although this will increase the setup time of
address, it can decrease access time.
Power-gating technique [10] can reduce leakage power by
shutting off the idle blocks. But power-gating technique also
exist some negative effects including a combination of noise,
performance penalty, area and power overhead, etc. In the
t 0
g 0'
Q Q Q
Q
i 5 .....
I 00
=
DSA
ICLK
GWLIDGWL
Sensing Enable (SAEN)
Fig. 1. Memory architecture block diagram (BM=2, bit line partition=2).
(RL=Replica Local Word line Driver, LD=Local Column MUX driver,
RCM=Replica Column MUX, DCM=Dummy Column MUX)
For ensuring fast and low-power operation, the internal
timing control path uses replica technique and self-timing
scheme to match the data path [4]. Fig. 2 presents the self-
timing read scheme (which uses the replica structure presented
in Fig. 1) waveform. The replica path uses replica word line
cells and bit line cells for tracking memory core cells. The
replica word line cells are the same as core word line cells; the
replica bit line cells are programmed to always store a zero.
Hence, its capacitance is the same as core cells including
gate/junction and wire parasitic capacitances. The replica
schemes can make the global bit line swing be around a tenth
of the supply when the replica bit line cells justly discharge
replica global bit line to around half of the supply during a
read cycle. The replica technique presented in Fig. 1 can vary
with the memory configuration variation.
CK
GBL(GBLB)
LWLIDLWL
Fig. 2. Self-timing read scheme waveform.
626
to local
WLDRV
I
I
0100_0 I
. :
II
I Block Driver i : 1
L.-. -. -. -. -. -. -' -.-.-.-. -. -. -. -. -. -. -. -' -. -. -'-'-'-' -. -. -. -. -. -. -. -' -I - -- - -- - - - - - - - - - - _..J
to local
WLDRV

: ! 11 !! !!!
: :: :: : : :1
! ! Ii !! :::1
! : Ii !!SAEN SA :: ! /
.. ": ---: 11' ! .. '! ---: I
: : I: :: :,
! ! I: !! :1
: :: :: :
:: :/
: : I: i i 11
: :: :: :::/
: : I: Block Driver : : :::I
'--- --'-- L_-----:.. 1_- ----...J - J l====- _
I:
II
i 1
Other Block Control Circuits i I
II
i:
II
i
BSI
BS2
DO
Fig. 4. Sensing enable pulse generating circuits and sense amplifier with data output block diagram (BM=2).
SAOUT
-+- Forced-stack
Long-channel-length
0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50
Voltage (V)
1.10
=
'6iJ
1.05
bll
=
's
1.00
z

<
00
"Cl
0.95
'";
E
.z 0.90
SAEN
Isensingenable(SAEN) timingmargin= Tm/Tp I
(a) Sensing enable (SAEN) timing margin definition.
Tm
(b) Normalized SAEN timing margin simulated in typical condition.
Fig. 6. Definition of SAEN timing margin and its normalization diagram.
Forced-stack effect [7] can be used at sensing enable
generation circuit to solve this problem (IVP and IVN in Fig.
4). Fig. 5 presents the simulation comparison results between
using forced-stack devices and using long channel length
devices to generate signal SAEN under the iso-area condition.
Iso-area is achieved by making the layout area after stack
forcing (IVP in Fig. 5) identical to long channel length (!NV
in Fig. 5). Under high voltage the two methods can get almost
same SAEN pulse width (the ratio is almost equal to 1), but
with the voltage decreasing forced-stack method can get wider
pulse width. This result in SAEN timing margin (defined in
Fig. 6(a)) gets enhanced in low voltage condition (shown in
Fig. 6 (b)).
IV. EXPERIMENTAL RESULTS
A test-chip has been fabricated using UMC's 90-nm low
leakage CMOS logic technology, with 17 memories and
embedded PLL etc. testing circuits which is designed for
SRAM's high speed and timing measurement. Fig. 7 shows
the layout of test-chip, which contains compiled SRAM's
1.25
1.00
_. ---..---
0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Voltage
3.75
3.50
3.25
o 3.00
2.75
2.50
"3
Q. 2.25
3 2.00

V'I 1.75
1.50
I- -40"( .. 25"( 125'( r
Fig. 5. Signal SAEN pulse width ratio (ratio = Tsaen_forced_stack /
Tsaen_long_channel_length) waveform. Ratio numbers are obtained
from simulation under iso-area condition.
SRAM compiler we only use the power-gating method for the
final inverter of the global word line driver and local word
line driver. The part power-gating technique doesn't require
long wakeup time and status retention circuits because a
PMOS (MPO) only connecting eight inverters (IVO - IV7),
which makes the parasitic capacitance of virtual power supply
very low. During standby mode only 1 of 32 WLDRV8s is
active; the others can decrease the standby current because
related PMOS devices are shut off. Hspice simulation results
shown that a shut-off WLDRV8 can decrease 80% standby-
current in high temperature condition with the expense of
about 5% area overhead and about 8% speed penalty.
B. Sensing Enable
The bit line decoupled latch type sense amplifier (SA) [9] is
used in the SRAM compiler. Fig. 4 shows the SA block
diagram and sensing enable pulse generating circuits, of
which output signal SAEN is used for switching all SAs
located at the same block. A latch circuit (keeper) is shared by
the SA located at Block 1 and Block 2. During the activation
of signal SAEN, SA will drive the sensing result to SAOUT
and make the new data latched by the keeper. With power
supply decreasing the drive ability of SA also will become
weak, but the parasitic capacitance of wire SAOUT is almost
invariable. This problem necessitates keeping enough design
margins in the design of the sensing enable.
627
Fig. 7. Test-chip layout.
ranging from 64b to 5l2kb in a variety of aspect ratios. The
features of test-chip can be found in table II.
32768X16 8192X8
Configuration BM
I
CM BM
I
CM
Area (mrrr') 0.852
I
0.686 0.146
I
0.108
Area Comparison +24.20% + 35.18%
Static Current (IlA) 9.661
I
8.277 1.320
I
1.187
DC Comparison + 16.72% + 11.20%
Dynamic Current (mA/MHz) 0.029
I
0.041 0.011
I
0.019
AC Comparison - 29.27% -42.11%
REFERENCES
[1] M. Yoshimoto, K. Anami, H. Shinohara, T. Yoshihara, H. Takagi, S.
Nagao, S. Kayano, and T. Nakano, "A divided word-line structure in the
static RAM and its application to a 64K full CMOS RAM," IEEE
Journal of Solid-State Circuits, vol. SC-18, no. 5, pp. 479-485, Oct.
1983.
[2] 1. C. Tou, P. Gee, 1. Duh, and R. Eesley, "A submicrometer CMOS
embedded SRAM compiler," IEEE Journal of Solid-State Circuits, vol.
27, no. 3,pp. 417-424, Mar. 1992.
[3] 1. S. Caravella, "A low voltage SRAM for embedded applications,"
IEEE Journal of Solid-State Circuits, vol. 32, no. 3, pp. 428-432, Mar.
1997.
[4] B. S. Amrutur and M. A. Horowitz, "A replica technique for wordline
and sense control in low-power SRAM's," IEEE Journal of Solid-State
Circuits, vol. 33, no. 8, pp. 1208-1219, Aug. 1998.
[5] A. Karandiskar and K. K. Parhi, "Low power SRAM design using
hierarchical divided bit-line approach," Proceeding International
Conference on Computer Design: VLSI in Computers and Processors,
pp. 82-88, Oct. 1998.
[6] M. Jagasivamani and D. S. Ha, "Development of a low-power SRAM
compiler," IEEE International Symposium on Circuits and Systems
(ISCASj, vol. 4, pp. 498-501, May 2001.
[7] S. Narendra, S. Borkar, V. De, D. Antoniadis, and A. Chandrakasan,
"Scalling of stack effect and its application for leakage reduction,"
Proceedings of the International Symposium on Low Power Electronics
and Design, pp. 195-200, Aug. 2001.
[8] B. Yang and L. Kim, "A low-power SRAM using hierarchical bit line
and local sense amplifiers," IEEE Journal ofSolid-State Circuits, vol. 40,
no. 6,pp. 1366-1376, Jun. 2005.
[9] S. Singh, S. Azmi, N. Agrawal, P. Phani, and A. Rout, "Architecture and
design of a high performance SRAM for SOC design," Design
Automation Conference, pp. 447-451,2002.
[10] H. Jiang, M. M. Sadowska, and S. R. Nassif, "Benefits and costs of
power-gating technique," Proceedings of the 2005 International
Conference on Computer Design, pp. 559-566, 2005
[11] 1. T. Kao and A. P. Chandrakasan, "Dual-threshold voltage techniques
for low-power digital circuits," IEEE Journal ofSolid-State Circuits, vol.
35, no. 7, pp. 1009-1018, July 2000.
v. CONCLUSION
A highly configurable embedded low power SRAM
compiler based on an industrial 90-nm CMOS process has
been demonstrated. The SRAMs compiled can greatly reduce
dynamic current by combining DWL and DBL techniques
with the help of replica and self-timing scheme. Enough
margin simulation and verification with the help of robust
circuits further guarantee the SRAMs compiled with wider
margin for correct functionality and accurate characterization.
The measurement results of test-chip have proved the design
correctness and low power efficiency.
ACKNOWLEDGMENT
It is our pleasure to thank Teddy and James for help with
the test-chip design, W. T. and Jason for testing of the chips,
Willis, Jack, Alex and Ya-Qi for helpful discussion on the
circuits design.
UMC
90-nmlP9M low leakage CMOS
90-nm IP5M
1.2V
4000Jlmx 4000Jlm
QFP 208
TABLE II
FEATURES OF TEST-CHIP
Foundry
Process
SRAM macros
Supply voltage
Die size
Package
Table III gives the power measurement results at the
operating voltage of 1.2 Y for two SRAM macros (in the table
column BM) in the test-chip. The data of CM (Column Mux.)
macros (with bit line partition architecture) come from
Faraday commercial SRAM compiler datasheets. The results
show that this design can reduce dynamic current by 29% for
the 5l2Kb SRAM and by 42% for the 64Kb SRAM. The
static current actually has a little increasing in this work due to
additional circuits overhead for the DWL implementation.
However, the total average current dissipation is still reduced
as it is dominated by dynamic current dissipation.
TABLE III
COMPARISON WITH OTHER WORKS
PLL
Silicon measurement confirmed complete functionality over
voltage (0.9 - 1.8Y) and temperature (-40 - 125C) ranges
with all memories. The SCAN, March C- and March C+
patterns were utilized by memory BIST (Built-in Self-Test)
embedded in the test-chip. An embedded PLL was used for
SRAM timing measurements and high speed memory BIST
(maximum frequency can reach 500MHz) testing.
628

You might also like