Directed Study Report

AUTOMATED STANDARD CELL LIBRARY GENERATION & STUDY OF CELL LIBRARY FUNCTIONAL CONTENT
Yunbum Jung Dept. of Electrical Engineering and Computer Science University of Michigan
jungy@engin.umich.edu
ABSTRACT
As the operating frequency of circuits increases, the use of a fixed library faces a limit in generating high performance circuits. One of the ways to go beyond the limit of a fixed cell library is the use of a fluid cell library. The fluid cell library provides a customized drive strength of each cell that is not in the fixed cell library but is required for a fine circuit tuning. For the effective generation of a fluid cell library as well as a fixed cell library, an automated flow is applied to generate a standard cell library. This paper presents the procedure for automated standard cell library generation and an overview of cell characterization. It also examines how each logic function in a cell library affects an automated circuit design. This experiment shows that the target cell library should be selectively chosen for the good quality of a synthesized design.
other tools that use these cells predict the behavior of circuits based on the characterized cell data. Before mentioning the advantages of automated cell characterization, it seems proper to consider the disadvantages of manual cell characterization. Manual cell characterization requires a cell designer to create netlists and interactively run simulations. With this method, stimulus must be developed and applied to the cell being characterized. Once the simulation is complete, the data is extracted from each run. Then the cell designer inserts the data into the gate level models and the datasheets. But this manual cell characterization is prone to cause errors and requires tremendous effort [6]. A cell library generation process includes cell design, layout generation, physical abstraction as well as cell characterization. Through this process, several designers may share these tasks. Therefore, if misunderstandings exist among designers, the manual method may lose consistency in the procedures. In order to reduce both the effort to generate a new standard cell library and the number of inadvertent errors introduced when these tasks are done manually, an automated flow is applied to generate a new cell library. The well-organized and automated flow provides consistency in procedures, increases the range of simulation capabilities at the cell characterization step, and minimizes the risk of errors.
1. INTRODUCTION
Although a standard cell methodology reduces the design effort in terms of time and cost, the performance of a synthesized design is very poor compared to a custom design. Many studies to improve the quality of a synthesized design have been performed [1, 2, 3, 4, 5] and it has been shown that simple modifications to the cell library significantly impact the performance of a synthesized design [1]. Especially in a high performance application, the use of a fixed cell library prevents fine device tuning for delay and power optimization [5]. As an alternative to the limit of a fixed cell library, the use of a fluid cell library suitable for circuit tuning was suggested in [3]. This kind of effort increases the need for automation in generating a cell library since an automated flow for generating new cells easily creates the various cells required for circuit optimization. In generating a new cell library, accurate characterization of each cell is important since
2.
STANDARD CELL GENERARTION
LIBRARY
As shown in Fig. 1, the process of generating a standard cell library consists of four major steps. Most of steps are automatically carried out. Netlist files in spice format are created at the cell design step. Layout that is the physical implementation of netlist is generated in the layout step. Verification
and parasitic extraction are also performed during the layout step. Stimulus generation, SPICE simulations, and data compilations are part of characterization. Physical abstraction of each cell for the place & route tool can be carried out in parallel with characterization. A physical abstract includes information about blockage layers, pin locations, and cell symmetry.
the falling/rising time due to the higher resistance. Therefore, each cell is restricted such that it does not have more than 4 series transistors.
Fig. 2 Parameterized Inverter template The cells of a standard cell library are categorized into seven groups as follows. Negative unate logic cells Positive unate logic cells Arithmetic cells Sequential cells Special cells Inverted input cells Low skew cells
Fig. 1 Standard cell library generation flow
2.1 CELL DESIGN

The cell design phase consists of circuit design and transistor sizing. After drawing a schematic representing a logic gate, the cell designer determines transistor size of each cell. Since the delay in circuits depends not only on the drive strength of each stage but also on its P/N width ratio, it is important to provide a good P/N width ratio of each cell in standard cell library. An optimal P/N ratio of each cell is derived such that it minimizes path delay [7]. Based on the optimal P/N ratio, the netlist template for each cell is generated. This netlist template is used for creating the cell netlist with the intended size. It has been shown that providing cells with only a single drive strength degrades the speed of a synthesized design [1]. Therefore, providing each cell in a variety of drive strengths is considered as a standard cell library design guideline. According to this guideline, most of the cells in a fixed standard cell library are designed in 4 drive strengths (xL, x1, x2, and x4) and the buffers and inverters have 5 additional drive strengths (x3, x8, x12, x16, and x20). For providing more variety of drive strengths easily, the size of each cell is parameterized. Fig. 2 illustrates how to use a parameterized netlist template. The P/N ratio of the inverter is fixed but the widths of transistors are scaled proportionally. If the number of transistors connected in series is large, it degrades
Negative unate logic cells consist of INV, NAND, NOR, AOI, OAI, and XNOR function families. Positive unate logic cells comprise BUF, AND, OR, AO, OA, and XOR function families. FADD (Full Adder) and HADD (Half Adder) function families compose arithmetic cells. DFF and LATCH function families are included in sequential cells. MUX and Tri-state belong to the special cells. And there are two more interesting groups. One group comprises inverted input cells and the other, low skew cells. Inverted input cells such as NAND2B and NOR3BB (B means inverted input) provide inverted inputs such that it effectively makes an internal connection between the inverter and the logic function. Low skew cells are used for clock distribution schemes where low skew and high speed are primary concerns. CLKINV and CLKBUF families belong to the low skew cells.
2.2 LAYOUT
As shown in Fig. 3, the layout phase consists of physical layout generation, layout verification, and parasitic extraction. ProTech, ProSpin, ProGen, and ProSticks from Prolific are used to generate physical layout from the netlist in spice format. ProTech configures the
Although ProGen tries to satisfy all design rules and layout constraints, it sometimes generates a cell with violations such as a cell height violation or design rule errors. Thus, the verification step is an essential part of layout generation to insure getting a correct layout. Calibre from Mentor graphics is used for accurate verification. Calibre performs layout vs. schematic (LVS) checking as well as design rule checking (DRC).
Fig.5 Symbolic AO21X2 layout Fig. 3 Layout step fabrication technology specification, design styles, the cell template, and the layer information. ProSpin reads in spice netlists describing individual cells and produces corresponding .cel and .db files that are read by ProGen and ProSticks respectively. The .cel file specifies the generators that should be called to create the final layout data and contains cell specific information such as transistor size and node name. And the .db file is used to generate the symbolic layout that is used by ProSticks. ProGen reads in the .cel file and produces the physical layout. After reading the .cel file, ProGen invokes the proper generators to produce a loose physical layout. Then ProGen compacts this initial layout to produce a final cell that is as small as possible [8]. Fig. 4 shows an example of a layout that is generated by the Prolific suite. Once the design violations are found, ProSticks can be used to fix those errors. ProSticks supports a graphical user interface for the symbolic layout. Fig. 5 shows an example of a symbolic layout. Simple routing rearrangement and transistor relocation on the symbolic layout can correct most of errors of simple cells. But, in case of complex cells such as MUX and ADDER, a height violation is hard to fix. Special options such as poly contact merging and diffusion/metal-1 contact wire are used to obtain more available routing space. Aggressive poly contact merging eliminates the unnecessary metal-1 wire between poly contacts. Diffusion nodes can be connected to the power or ground rails with a contact of diffusion and metal1 wire such that it makes other metal-1 wires go over the active region. The last step of the layout phase is parasitic extraction. This is the process of creating an electrical model of the physical interconnections. The physical interconnect does not behave as an ideal wire. Instead, it behaves like a network of capacitances, inductances, and resistors, which can dominate circuit behavior. It does not do much good to do power or timing analysis of a design without the parasitic network. Xcalibre from Mentor Graphics is used for parasitic extraction. The original spice netlists combined with the parasitic network are used for characterization simulation.
Fig.4 AO21X2 layout
2.3 CHARACTERIZATION
Characterization enables a designer to abstract timing and power models. This abstraction shifts a design from the transistor level to the gate level, enabling synthesis, floor planning, place & route, and delay & power calculation [9].
recovery constraint, and removal constraint are modeled for timing characterization. Both output transition time and propagation time are required for gate level synthesis and delay calculating tools [9]. In those tools, output transition time (and the interconnect wire parasitic data if possible) is used for estimating input slew rates of successive cells and the propagation time of each cell is extracted from the cell library table based on input slew rate and load. Consequently, the delay between two nodes in a design can be calculated with the propagation time and output transition time of each cell on the path between two nodes. Since a signal arrives at an input pin with a ramp time and a sequential cell takes some time to latch the data signal correctly, the data signal that arrives to a sequential cell has to be stabilized before and after clocking by defining setup constraint and hold constraint respectively. Clock transition to active is the concern of measuring setup and hold constraint of edge-triggered sequential cells. If an unstable data signal arrives at an input near the clock transition to active, the edge triggered sequential cell may evaluate a wrong output or it goes through a metastable state. The unintended data is held until the next clock transition to active. On the other hand, clock transition to inactive is the concern of measuring setup and hold constraint of level-sensitive sequential cells. If an unstable data signal arrives at an input near the clock transition to inactive, the unintended data may be held while the clock is in the inactive state as in the case of the edgetriggered sequential cell. In a design with sequential cells, both attributes constrain the delay of combinational circuits that are placed between two sequential cells where the operation frequency of the design is given. Fig. 7 shows consecutive pipeline stage formed by two edge-triggered clocking sequential cells, comprising a long path and a short path [10].
Fig. 6 Characterization in terms of input slew rate and load capacitance As shown in Fig. 6, a cell is usually characterized in terms of the input slew rate tin and output capacitive load CL where supply voltage, temperature, and process values are given. But this traditional approach is challenged in the era of deep sub-micron process. The single capacitance is no longer enough to represent a load that acts like a RC network. However, the auto-characterization tool is implemented based on the traditional definition. Another concern of cell characterization is stimulus generation. Current standard cell libraries can handle only single switching input, that is, it cannot deal with multiple switching inputs that are frequently encountered in a real operation. This is another shortcoming of cell characterization. Although an exhaustive enumeration of all input states satisfies the requirement of stimulus, it is a waste of simulation time. Therefore, selecting a minimal set of input vectors is seriously considered to reduce simulation time. The characterized cells have several common attributes such as output transition times, propagation delays, internal switching power, leakage power, input pin capacitances, and cell area. Sequential cells have additional requirements of characterizing relative signals. Relative signals are signals that are timing-critical to another signals state. Relative signals include setup, recovery, hold, and removal time [6].
2.3.1 Timing characterization

The timing attributes such as output transition time, propagation time, setup constraint, hold constraint, Fig. 7 Consecutive pipeline stage formed by two edge-triggered sequential cells
On the long path, the maximum time available for evaluation of combinational logic in one clock period, tCLmax, is given by tCLmax = tCYeff (tSU2+tCQ1) where tCYeff is the effective cycle time, tSU2 is the setup constraint of the second sequential cell, and tCQ1 is the Clock to Q propagation time of the first sequential cell. On the short path, if the next state value from the second sequential cell reaches the first sequential cell during the hold time of the first sequential cell, the next state value will corrupt the current state value of the first sequential cell. The minimum propagation time, tCLmin, through combinational logic on the short path is expressed by tCLmin = tSK + (tH1 tCQ2) where tSK is the clock skew between clocking of both sequential cells, tH1 is the hold constraint of the first sequential cell, and tCQ2 is the Clock to Q propagation time of the second sequential cell. Recovery and removal constraints describe the timing requirements on the control signals, such as preset or clear, with respect to the clock signal. A sequential cell needs some time to be out of the influence of the control signal after the control signal becomes inactive. This time is referred to as recovery constraint. Therefore, the control signal should become inactive at least a time (recovery constraint) before clocking in order to insure the clocking effective. On the other hand, removal constraint is the minimum time for control signal to influence the latched value. If a control signal becomes inactive before the removal constraint, the control signal will not affect the operation of the sequential cell. In the simulations for the timing attributes mentioned above, the minimal stimulus comprises input vectors that cause an output transition, although the necessary input vectors are different according to each attribute. Input vectors can be subdivided into data signals, control signals, and clock. More details of stimulus for each attribute are described in the following sections.
voltage range to 90% of the voltage range). The propagation time is measured between predetermined delay threshold value of input signal and that of output signal (e.g., 50% of the voltage range of input signal to 50% of the voltage range of output signal). A data transition that can cause an output transition is the required condition of the stimulus for both attributes. In sequential cells, the Clock to Q propagation time can be obtained from the clock transition. In the level-sensitive clocking sequential cells such as LATCH, the D to Q propagation time can also be obtained with the clock in the active state.
Fig. 8 Output transition and propagation time
Setup / Hold constraint

The setup constraint is generally defined as the minimum time allowed between the arrival of the data and the transition of the clock signal. If the data signal makes a transition during setup time, an incorrect value may be latched [11, 12]. But this definition needs to be modified to improve the delay performance of a synthesized design that uses sequential cells because the minimum setup constraint tends to cause a long Clock to Q propagation time. Therefore, one more condition can be added to the setup definition such that the setup constraint does not degrade the Clock to Q propagation time more than a pre-determined tolerance (e.g., 5% of Clock to Q propagation time that can be obtained when the time between the data arrival and the clock transition is enough). The hold constraint describes the minimum time allowed between the transition of the clock signal and the latching of the data. If the data signal makes a transition during hold time, an incorrect value may be latched [11, 12]. Data that can change the output state is the required condition of the stimulus. In the case of the cell with control input, the control signal should be set in inactive state.
Output transition / Propagation time

As shown in Fig. 8, the output transition time is measured between two pre-determined edge threshold values of output signal (e.g., 10% of the
Fig. 9 illustrates the setup and hold constraints of an edge-triggered sequential cell where clock is high active. Fig. 10 illustrates the setup and hold constraints of a level-sensitive sequential cell where clock is high active.
Fig. 11 illustrates recovery and removal constraint of an edge-triggered sequential cell where clock is high active and control is low active.
Fig. 9 Setup and Hold constraint of an edge triggered sequential cell.
Fig. 11 Recovery and removal constraint of an edge triggered sequential cell
Bisection method
In order to measure relative signal characterization, the bisection method is used. Bisection is a method of optimization that employs a binary search to find the value of an input variable associated with a goal value of an output variable. This method uses a binary search to locate the output variable goal value within a search range of the input variable by iteratively halving that range to converge rapidly on the target value. The measured value of the output variable is compared with the goal value every iteration [13].
Fig. 10 Setup and Hold constraint of a level sensitive sequential cell.
Recovery / Removal constraint

The recovery constraint describes the minimum allowable time between the control pin transition to the inactive state and the active edge of the synchronous clock signal [12]. Like the setup time, the tolerance condition is added into the condition of recovery constraint. The removal constraint describes the minimum allowable time between the active edge of the clock pin while the asynchronous control pin is active and the inactive edge of the asynchronous control pin [12]. Data that can change the preset/clear value and control transition to inactive are the required conditions of the stimulus. The only difference of measuring recovery and removal constraint of a level-sensitive cell is that clock transition to inactive is the required condition of stimulus while clock transition to active is the required condition of stimulus for an edge-triggered sequential cell.
Fig. 12 Setup constraint search using bisection Fig. 12 shows how to determine setup constraint with bisection method where the goals are the output transition and the allowable Clock to Q propagation time. To start the binary search, a lower boundary and an upper boundary are specified. Data transition 1 at the lower boundary is early enough to cause a good output signal. Data transition 2 at the upper boundary is too late to
change output signal. This means that the candidate for setup constraint exists between the upper and the lower boundaries. Consequently, the bisection algorithm tests data transition at the midpoint between both boundaries. Data transition 3 at the mid-point changes output signal but causes a long propagation time, that is, data transition 3 does not meet the goal. The bisection algorithm sets the mid-point as the new upper boundary. Given the new range, the bisection algorithm tests data transition at the new mid-point. If the output value satisfies goals, the new mid-point is set as the new lower boundary. Otherwise, the mid-point is set as the new upper boundary. Then the bisection algorithm tests data transition at the new mid-point within the new range again. The bisection algorithm iterates setting new boundary and mid-point until the binary search reaches a process-termination criterion. Data transition 4 is the latest data transition that satisfies the goals. Therefore, setup constraint, tSU, is given by tSU = t2 t1
Dynamic power is the power dissipated when a circuit is active. Dynamic power is divided into switching power and internal power. Switching power results from charging/discharging of load capacitance. Switching power is calculated by a gate level power analysis tool where the interconnect parasitic is known [9]. Therefore, switching power is excluded from power characterization of cells. While input or output signals switch, power is also dissipated by internal capacitive charging/discharging and short circuit dissipation. Since this kind of power is dissipated in the cell during signal switching, it is called internal power.
Power annotated library

PowerArc from Synopsys is used to generate a power-annotated library from a library containing no power data. This tool automatically calculates stimulus for power characterization and runs power simulation. PowerArc does not distinguish rising and falling output transitions in calculating internal power [14]. In both output transition cases, internal power dissipations are given by
t2 1 2 P = I (t ) Vdd t C L Vdd t1 2
2.3.2 Power characterization

Power dissipation can be handled in terms of static power and dynamic power as shown in Fig. 13. Static power is the power that is dissipated when the cell is stable, that is, there is no signal transition on any inputs or outputs of the cell. Static power is dissipated in a number of ways. The largest consumption of static power results from source to drain subthresold leakage. This leakage is caused by reduced threshold voltage that prevents the gate from turning off completely. Static power dissipation also occurs when current leaks between the diffusion layers and substrate. For this reason, static power is often called leakage power [12].
where I(t) is current at power node, Vdd is source voltage, and CL is load capacitance. Consequently, it is no wonder that negative power values are found in the power-annotated library that is generated by PowerArc. Internal power is overestimated at rising output transition and underestimated at falling output transition. Nevertheless, these internal power values are acceptable if internal power dissipation is considered for a given period.
Ileakage
ISC
Iinter-node Cinter-node CL
ICL
ISC Cinter-node Iinter-node
ICL CL
(a) Static power
(b) Dynamic power (rising)
(b) Dynamic power (falling)
Fig.13 Power dissipation
2.3.3 Input capacitance and cell area

Input capacitance is used for the part of the output load of the previous cell when switching power is calculated by a gate level power analysis tool or path delay is calculated by a timing analysis tool. Input capacitance value is extracted from the parasitic file of each cell that is made during the layout phase. The cell area attribute is used to estimate total cell area. The cell area value is calculated from the layout during the layout phase and then stored in temporary files. In the characterization phase, that value is inserted into cell area table of cell library.
suggested and those performance improvement.
methods
achieved
Another possible approach to improve the quality of the final circuit is wisely choosing the logic functions in the target cell library. In order to examine how each logic function group mentioned in section 2.1 affects the quality of the circuit and how synthesis tool picks up cells to satisfy constraints, a set of benchmark circuits are synthesized, placed & routed, and resynthesized using the libraries shown in Table 1. These libraries are formed by selectively choosing logic function groups from Artisan standard cell library in TSMC 0.18-micron technology.
Lib 1 Tri-state Sequential Negative unate Positive unate Inverted input Arithmetic Mux Low skew Lib 2 Lib 3 Lib 4 Lib 5 Lib 6 Lib 7 Lib 8
2.4 Physical abstraction and rest works

The individual cells of the cell library are described in layout format as GDSII. In order that place & route tools refer to cells in a cell library, physical abstracts for cell layouts are needed. The physical abstracts contain information about blockage layers, pin locations, and cell symmetry. Envisia Abstract Generator from Cadence is used to create physical abstract views for all cells. The physical abstracts are exported in library exchange format (*.lef) and used in Cadence place & route tools. Once the characterization step is completed, the *.lib file in ASCII text format is created. This file is compiled into a *.db file in Synopsys database format using library compiler from Synopsys. And, the *.lib file is also compiled into a *.tlf file in timing library format using syn2tlf from Cadence. The *.tlf file is used for timing driven placement and routing in Cadence place & route tool.
x x x
x x x x
x x x x
x x x x x x
x x x x x x x
x x x x
x x x x x
x x x x x x x x
Table. 1 Overview of cell libraries
3.1 Design Flow
3. RELATED STANDARD CELL LIBRARY STUDY

In the standard cell design style, target cell libraries as well as design flow affect the quality of a final circuit. Typically, synthesis techniques optimize a circuit in two phases, logic minimization and library-mapping phase [15, 16, 17]. During the library-mapping phase, synthesis tool choose the structures and the size of the gate from the target cell libraries. It is apparent that the limited sizes of each gate in the target cell libraries prevent good solutions. As an effort to provide finer granularity of the gate size, transistor level resynthesis [2] and fluid cell library [3] were Fig. 15 Standard cell design flow
As shown in Fig.15, initial synthesis and place & route are carried out to get the initial layout. Then the decent wire load model from the initial layout is extracted and the cell library is upgraded with the extracted wire load model. With the upgraded cell library, synthesis and place & route are performed and the layout is generated again. From the layout, parasitics can be extracted. Those parasitics are used for input of re-optimization in aspect of cell size. During the re-optimization, cells with new size replace the cells chosen at the synthesis step. For the input of timing and power analysis, resynthesized circuit is placed & routed again and parasitics are extracted from the final layout. Design Compiler from Synopsys is used for synthesis. Silicon Ensemble from Cadence is used for place & route, and clock tree generation. HyperExtract from Cadence is used for wire load model extraction and parasitic extraction. Library Compiler from Synopsys is used for upgrading cell library with the extracted wire model. PrimeTime from Synopsys is used for timing analysis. NanoSim from Synopsys is used for power analysis.
depends on the operation speed of a circuit and what kind of circuit is designed.
3.3.1 DSP circuits

Given the target clock range (4~9ns) that is used in the examination, the target clock vs. delay curves of VP2 benchmark shows the low plateau region in Fig. 16. The area of circuits still increases while there is no performance increase in this low plateau region. Consequently, too small target clock degrades the quality of design. From the low plateau region, the lowest delay can be obtained. Therefore, the power (or area) vs. delay plots of VP2 benchmark will show the characteristics at the low or middle delay regions. On the other hand, the target clock vs. delay curves of the CMUDSP benchmark gradually decreases without a low plateau region in Fig. 17. It means that there exists the room to reduce delay of CMUDSP benchmark. Therefore, the power (or area) vs. delay plots of CMUDSP will show the characteristics at the middle or high delay regions
10
3.2 Benchmark circuits

Delay [ns]
Library 1 9 Library 2 Library 3 Library 4 8 Library 5 Library 6 7 Library 7 Library 8
Four benchmark circuits were used for the experiments. In order to observe how each library affects datapath-dominated circuits and controllerdominated circuits, two of them (VP2 and CMUDSP) are chosen from Digital Signal Processor (DSP) and the others (GPIO and CAN) are chosen from Controller. Arithmetic functions are heavily used in DSP while it is less frequently used in Controller. Using the standard cell design flow described in section 3.1, each benchmark circuit is designed with various target clocks and different libraries.
6 3 4 5 6 7 8 9 10 Target clock [ns]
Fig. 16 Target clock vs. delay for VP2

10
9 Library 1 8 Library 2 Library 3 7 Library 4 Library 5 6 Library 6 Library 7 5 Library 8
3.3. Results
Power (or area) vs. delay plots are an effective way to compare cell libraries since the efficiency of achieving a particular delay is important [4]. In order to make power (or area) vs. delay plots, a set of benchmark circuits are designed using different libraries within some target clock range. Through timing analysis and power analysis, power (or area) vs. delay plots are obtained. From this study, several interesting phenomena are found. The largest library, lib 8, does not always produce the best result. Instead, the choice of the best library
Delay [ns]
4 3 4 5 6 7 8 9 10 Target clock [ns]
Fig. 17 Target clock vs. delay for CMUDSP As shown in Fig. 18 and 19, the library, lib 4, including complex cells such as arithmetic and mux cells shows good area efficiency at the high delay region while adding inverted input cells to them (lib 5 and lib 8) somewhat degrades area
efficiency at the high delay region. But, as the delay decreases, lib 4 is prone to increase the area of design quickly. Consequently, lib 4 results in worse area efficiency than lib 5 or lib 8 at the middle or low delay regions. This means that for good area efficiency at high speed, inverted input cells are needed.
700 600 500 400 300 200 100 0 AND ADD XOR MX 8ns 9ns
260000 Library 1 Library 2 220000 Cell area Library 3 Library 4 200000 Library 5 180000 Library 6 Library 7 160000 Library 8
240000
Fig. 20 Complex cell count of VP2 where lib 4 is used
140000 6 7 8 Delay [ns] 9 10
Fig. 18 Delay vs. area curves for VP2
500000 490000 480000 Library 2 470000 Library 3 Cell area 460000 Library 4 450000 Library 5 440000 Library 6 430000 Library 7 420000 410000 400000 4 5 6 7 Delay [ns] 8 9 10 Library 8 Library 1
1400 1200 1000 800 600 400 200 0
8ns 9ns
Fig. 19 Delay vs. area curves for CMUDSP As delay becomes close to the lowest delay, all libraries show the similar area efficiency in Fig. 18. This can be explained by the decomposition of complex cells. As expected, the synthesis tool basically increases the ratio of cells with higher drive strength in the circuit as target clock frequency increases. When this simple increase reaches the limit, complex cells such as positive unate and arithmetic cells are additionally decomposed into relatively simple cells. These decompositions come with the cost of the sudden increase of total cell count as well as the increase of total cell area. Therefore, the area benefits of complex cells are reduced. These decompositions of complex cells are observed in the VP2 benchmark. As the target clock changes from 9ns to 8ns, the total cell count increases from 4208 to 7148. Fig 20 shows the complex cell count and Fig 21 shows the relatively simple cell count at both target clocks (8ns and 9ns).
At the low delay region, lib 1 produces the best power efficient circuits as shown in Fig. 22 while all libraries tend to build circuits similar in area efficiency. On the other hand, at the high delay region, the power efficiency is similar over the all libraries as shown in Fig. 23. From this result, it is apparent that the power density of complex cells such as arithmetic and mux cells is higher than the power density of negative cells. Therefore, the power considering design is proper to high-speed design while the area efficiency is considered for design at low speed circuits
70000 65000 60000 Avg. power [uW] 55000 50000 45000 40000 35000 30000 25000 6 7 8 Delay [ns] 9 10 Library 1 Library 2 Library 3 Library 4 Library 5 Library 6 Library 7 Library 8
Fig. 22 Delay vs. power curves for VP2
10
IN V N O R N AN D AO I O AI M X XN I O R O R
Fig. 21 Simple cell count of VP2 where lib 4 is used
115000 105000 Library 1 95000 AVG power [uW] Library 2 85000 75000 Library 5 65000 55000 Library 8 45000 4 5 6 7 Delay [ns] 8 9 10 Library 6 Library 7 Library 3 Library 4
the available delay range is narrow. The target clock vs. delay curves of the GPIO benchmark also shows the narrow delay range. Therefore, it is hard to distinguish delay regions in controller circuits. The increase of the ratio of cells with higher drive strength and the lowest drive strength also occurs as the clock becomes fast. The decomposition of positive cells is observed. Even though the decomposition of positive cells happens, that does not increase the total cell count and the area as much as the decomposition of arithmetic cells. Within narrow delay range, irregular power (or area) vs. delay curves are scattered. Therefore, it is hard to find which library produces the best quality of circuits.
Fig.23 Delay vs. power curves for CMUDSP One more interesting thing is that the ratio of cells with the lowest drive strength (xL) increases as operation frequency of the circuit gets fast as shown in Fig. 24. This result explains that the lowest drive strength cells are needed to form longer buffer trees even though it has higher capacitance due to larger active area than cells with a time drive strength (x1).
14 12 10 8 6 4 2 4 6 8 10 Delay [ns ]
4. CONCLUSIONS AND FUTURE WORK

The automatic cell library generation flow is developed to generate a new cell library easily without inadvertent errors introduced when a cell library is generated manually. The parameterized cell methodology in the automatic flow makes it possible to generate a fluid cell library. Since the fluid cell library consists of custom cells with optimal sizes that can be generated when actual wire load parasitics are known after placement, it is expected to attain power and performance close to custom designs. This work can be extended to generate the cell library for an advanced process. Although high performance integrated circuits often use an advanced process technology, designs based on standard cells use processes that lag by several generations due to the absence of the cell library for advanced process. A cell library in a Silicon on Insulator (SOI) process will be generated to support designs using the SOI process. Then it will be examined how the cell library in SOI improves the power and performance of circuits. The study of cell library functional content shows that the largest library does not always produce the best result. For the overall good quality of a datapath-dominated circuit, the best cell libray is chosen in the way that the area efficiency is considered at the low speed circuit while the power efficiency is considered at the high-speed circuit. In the case of controller-dominated circuit, it is hard to find which library produces the good quality of circuits.
Lib 1 Lib 2 Lib 3 Lib 4 Lib 5 Lib 6 Lib 7 Lib 8
Fig. 24 % of cells with xL size for CMUDSP
3.3.2 Controller circuits

5 4.5 4 Delay[ns] 3.5 3 Library 5 2.5 Library 6 2 Library 7 1.5 1 0 1 2 3 4 5 6 7 Library 8 Library 1 Library 2 Library 3 Library 4
[%]
Target clock [ns]
Fig. 25 Target clock vs. delay curves of CAN Given the target clock range (0.5~6ns) that is used in the experiments, the target clock vs. delay curves of the CAN benchmark show the high plateau region as well as the low plateau region as shown in Fig. 25. Differently from DSP circuits,
11
REFERENCE
[1] Ken Scott and Kurt Keutzer, Improving Cell Library for Synthesis , Proc. of Custom Integrated Circuit Conference (CICC), pp. 128-131, 1994 [2] S. Gavrilov, A. Glebov, S. Pullela, S. C Moore, A. Dharchoudhury, R. Panda, G. Vijayan, and D. T. Blaauw, Library-Less Synthesis for Static CMOS Combinational Logic Circuits , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 658-662, 1997 [3] Gregory A. Northrop and Pong-Fei Lu, A Semi-Custom Design Flow in High-Performance Microprocessor Design , Proc. of Design Automation Conference (DAC), pp. 426-431, 2001 [4] Miodrag Vujkovic and Carl Sechen, Optimized Power-Delay Curve Generation for Standard Cell ICs , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 387-394, 2002 [5] K. Keutyzer and E. Girczyc, Panel: Cell libraries build vs buy; static vs. dynamic , Proc. of Design Automation Conference (DAC), pp. 341-342, 1999 [6] Teri Hike, McFaul and Karl Perrey, Characterizing a Cell Library using ICCS , Proc of ASIC Seminar and Exhibit, pp. p12/5.1-p12/5.4,1990 [7] David S. Kung and Ruchir Puri, Optimal P/N Width Ratio Selection for Standard Cell Libraries , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 178-184, 1999 [8] Prolific User Guide [9] Binay Ackalloor and Dinesh Caitonde, An overview of Library Characterization in Semi-Custom Design , Proc. of Custom Integrated Circuit Conference (CICC), pp. 305-312, 1998 [10] Anantha Chandrakasan, William J. Bowhill,, and Frank Fox, Ed. Design of High-performance Microprocessor Circuits , IEEE Press, New York, pp 215-218, 2001 [11] Neil H. E. Weste and Karman Eshraghian, Ed Principles of CMOS VLSI design , Addison-Wesley, pp 317-325, 1994 [12] Design Compiler User Manual [13] Star-Hspice Manual [14] PowerArc User Guide [15] Eric Lehman, Yosinori Watanabe, Joel Grodstein, and Heather Harkness, Logic Deocomposition during Technology Mapping , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 242-245, 1995 [16] Chi-Ying Tsui, Massoud Pedram, and Alvin M. Despain, Technology Decomposition and Mapping Targeting Low Power Dissipation , Proc. of Design Automation Conference (DAC), pp. 68-73, 1993 [17] Randal E. Bryant, Graph-Based Algorithms for Boolean Function Manipulation , IEEE Transactions on Computers, pp. 677-699, 1985
12

Directed Study Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Directed Study Report

Uploaded by

Copyright:

Available Formats

AUTOMATED STANDARD CELL LIBRARY GENERATION & STUDY OF CELL LIBRARY FUNCTIONAL CONTENT

STANDARD CELL GENERARTION

Fig. 1 Standard cell library generation flow

2.1 CELL DESIGN

Fig.4 AO21X2 layout

2.3.1 Timing characterization

Fig. 8 Output transition and propagation time

Setup / Hold constraint

Output transition / Propagation time

Fig. 9 Setup and Hold constraint of an edge triggered sequential cell.

Fig. 11 Recovery and removal constraint of an edge triggered sequential cell

Fig. 10 Setup and Hold constraint of a level sensitive sequential cell.

Recovery / Removal constraint

Power annotated library

2.3.2 Power characterization

ISC Cinter-node Iinter-node

(a) Static power

(b) Dynamic power (rising)

(b) Dynamic power (falling)

Fig.13 Power dissipation

2.3.3 Input capacitance and cell area

suggested and those performance improvement.

2.4 Physical abstraction and rest works

Table. 1 Overview of cell libraries

3.1 Design Flow

3. RELATED STANDARD CELL LIBRARY STUDY

3.3.1 DSP circuits

3.2 Benchmark circuits

Library 1 9 Library 2 Library 3 Library 4 8 Library 5 Library 6 7 Library 7 Library 8

6 3 4 5 6 7 8 9 10 Target clock [ns]

Fig. 16 Target clock vs. delay for VP2

9 Library 1 8 Library 2 Library 3 7 Library 4 Library 5 6 Library 6 Library 7 5 Library 8

4 3 4 5 6 7 8 9 10 Target clock [ns]

Fig. 20 Complex cell count of VP2 where lib 4 is used

140000 6 7 8 Delay [ns] 9 10

Fig. 18 Delay vs. area curves for VP2

1400 1200 1000 800 600 400 200 0

Fig. 22 Delay vs. power curves for VP2

Fig. 21 Simple cell count of VP2 where lib 4 is used

4. CONCLUSIONS AND FUTURE WORK

Lib 1 Lib 2 Lib 3 Lib 4 Lib 5 Lib 6 Lib 7 Lib 8

Fig. 24 % of cells with xL size for CMUDSP

3.3.2 Controller circuits

Target clock [ns]

You might also like