Professional Documents
Culture Documents
Yunbum Jung Dept. of Electrical Engineering and Computer Science University of Michigan
jungy@engin.umich.edu
ABSTRACT
As the operating frequency of circuits increases, the use of a fixed library faces a limit in generating high performance circuits. One of the ways to go beyond the limit of a fixed cell library is the use of a fluid cell library. The fluid cell library provides a customized drive strength of each cell that is not in the fixed cell library but is required for a fine circuit tuning. For the effective generation of a fluid cell library as well as a fixed cell library, an automated flow is applied to generate a standard cell library. This paper presents the procedure for automated standard cell library generation and an overview of cell characterization. It also examines how each logic function in a cell library affects an automated circuit design. This experiment shows that the target cell library should be selectively chosen for the good quality of a synthesized design.
other tools that use these cells predict the behavior of circuits based on the characterized cell data. Before mentioning the advantages of automated cell characterization, it seems proper to consider the disadvantages of manual cell characterization. Manual cell characterization requires a cell designer to create netlists and interactively run simulations. With this method, stimulus must be developed and applied to the cell being characterized. Once the simulation is complete, the data is extracted from each run. Then the cell designer inserts the data into the gate level models and the datasheets. But this manual cell characterization is prone to cause errors and requires tremendous effort [6]. A cell library generation process includes cell design, layout generation, physical abstraction as well as cell characterization. Through this process, several designers may share these tasks. Therefore, if misunderstandings exist among designers, the manual method may lose consistency in the procedures. In order to reduce both the effort to generate a new standard cell library and the number of inadvertent errors introduced when these tasks are done manually, an automated flow is applied to generate a new cell library. The well-organized and automated flow provides consistency in procedures, increases the range of simulation capabilities at the cell characterization step, and minimizes the risk of errors.
1. INTRODUCTION
Although a standard cell methodology reduces the design effort in terms of time and cost, the performance of a synthesized design is very poor compared to a custom design. Many studies to improve the quality of a synthesized design have been performed [1, 2, 3, 4, 5] and it has been shown that simple modifications to the cell library significantly impact the performance of a synthesized design [1]. Especially in a high performance application, the use of a fixed cell library prevents fine device tuning for delay and power optimization [5]. As an alternative to the limit of a fixed cell library, the use of a fluid cell library suitable for circuit tuning was suggested in [3]. This kind of effort increases the need for automation in generating a cell library since an automated flow for generating new cells easily creates the various cells required for circuit optimization. In generating a new cell library, accurate characterization of each cell is important since
2.
LIBRARY
As shown in Fig. 1, the process of generating a standard cell library consists of four major steps. Most of steps are automatically carried out. Netlist files in spice format are created at the cell design step. Layout that is the physical implementation of netlist is generated in the layout step. Verification
and parasitic extraction are also performed during the layout step. Stimulus generation, SPICE simulations, and data compilations are part of characterization. Physical abstraction of each cell for the place & route tool can be carried out in parallel with characterization. A physical abstract includes information about blockage layers, pin locations, and cell symmetry.
the falling/rising time due to the higher resistance. Therefore, each cell is restricted such that it does not have more than 4 series transistors.
Fig. 2 Parameterized Inverter template The cells of a standard cell library are categorized into seven groups as follows. Negative unate logic cells Positive unate logic cells Arithmetic cells Sequential cells Special cells Inverted input cells Low skew cells
Negative unate logic cells consist of INV, NAND, NOR, AOI, OAI, and XNOR function families. Positive unate logic cells comprise BUF, AND, OR, AO, OA, and XOR function families. FADD (Full Adder) and HADD (Half Adder) function families compose arithmetic cells. DFF and LATCH function families are included in sequential cells. MUX and Tri-state belong to the special cells. And there are two more interesting groups. One group comprises inverted input cells and the other, low skew cells. Inverted input cells such as NAND2B and NOR3BB (B means inverted input) provide inverted inputs such that it effectively makes an internal connection between the inverter and the logic function. Low skew cells are used for clock distribution schemes where low skew and high speed are primary concerns. CLKINV and CLKBUF families belong to the low skew cells.
2.2 LAYOUT
As shown in Fig. 3, the layout phase consists of physical layout generation, layout verification, and parasitic extraction. ProTech, ProSpin, ProGen, and ProSticks from Prolific are used to generate physical layout from the netlist in spice format. ProTech configures the
Although ProGen tries to satisfy all design rules and layout constraints, it sometimes generates a cell with violations such as a cell height violation or design rule errors. Thus, the verification step is an essential part of layout generation to insure getting a correct layout. Calibre from Mentor graphics is used for accurate verification. Calibre performs layout vs. schematic (LVS) checking as well as design rule checking (DRC).
Fig.5 Symbolic AO21X2 layout Fig. 3 Layout step fabrication technology specification, design styles, the cell template, and the layer information. ProSpin reads in spice netlists describing individual cells and produces corresponding .cel and .db files that are read by ProGen and ProSticks respectively. The .cel file specifies the generators that should be called to create the final layout data and contains cell specific information such as transistor size and node name. And the .db file is used to generate the symbolic layout that is used by ProSticks. ProGen reads in the .cel file and produces the physical layout. After reading the .cel file, ProGen invokes the proper generators to produce a loose physical layout. Then ProGen compacts this initial layout to produce a final cell that is as small as possible [8]. Fig. 4 shows an example of a layout that is generated by the Prolific suite. Once the design violations are found, ProSticks can be used to fix those errors. ProSticks supports a graphical user interface for the symbolic layout. Fig. 5 shows an example of a symbolic layout. Simple routing rearrangement and transistor relocation on the symbolic layout can correct most of errors of simple cells. But, in case of complex cells such as MUX and ADDER, a height violation is hard to fix. Special options such as poly contact merging and diffusion/metal-1 contact wire are used to obtain more available routing space. Aggressive poly contact merging eliminates the unnecessary metal-1 wire between poly contacts. Diffusion nodes can be connected to the power or ground rails with a contact of diffusion and metal1 wire such that it makes other metal-1 wires go over the active region. The last step of the layout phase is parasitic extraction. This is the process of creating an electrical model of the physical interconnections. The physical interconnect does not behave as an ideal wire. Instead, it behaves like a network of capacitances, inductances, and resistors, which can dominate circuit behavior. It does not do much good to do power or timing analysis of a design without the parasitic network. Xcalibre from Mentor Graphics is used for parasitic extraction. The original spice netlists combined with the parasitic network are used for characterization simulation.
2.3 CHARACTERIZATION
Characterization enables a designer to abstract timing and power models. This abstraction shifts a design from the transistor level to the gate level, enabling synthesis, floor planning, place & route, and delay & power calculation [9].
recovery constraint, and removal constraint are modeled for timing characterization. Both output transition time and propagation time are required for gate level synthesis and delay calculating tools [9]. In those tools, output transition time (and the interconnect wire parasitic data if possible) is used for estimating input slew rates of successive cells and the propagation time of each cell is extracted from the cell library table based on input slew rate and load. Consequently, the delay between two nodes in a design can be calculated with the propagation time and output transition time of each cell on the path between two nodes. Since a signal arrives at an input pin with a ramp time and a sequential cell takes some time to latch the data signal correctly, the data signal that arrives to a sequential cell has to be stabilized before and after clocking by defining setup constraint and hold constraint respectively. Clock transition to active is the concern of measuring setup and hold constraint of edge-triggered sequential cells. If an unstable data signal arrives at an input near the clock transition to active, the edge triggered sequential cell may evaluate a wrong output or it goes through a metastable state. The unintended data is held until the next clock transition to active. On the other hand, clock transition to inactive is the concern of measuring setup and hold constraint of level-sensitive sequential cells. If an unstable data signal arrives at an input near the clock transition to inactive, the unintended data may be held while the clock is in the inactive state as in the case of the edgetriggered sequential cell. In a design with sequential cells, both attributes constrain the delay of combinational circuits that are placed between two sequential cells where the operation frequency of the design is given. Fig. 7 shows consecutive pipeline stage formed by two edge-triggered clocking sequential cells, comprising a long path and a short path [10].
Fig. 6 Characterization in terms of input slew rate and load capacitance As shown in Fig. 6, a cell is usually characterized in terms of the input slew rate tin and output capacitive load CL where supply voltage, temperature, and process values are given. But this traditional approach is challenged in the era of deep sub-micron process. The single capacitance is no longer enough to represent a load that acts like a RC network. However, the auto-characterization tool is implemented based on the traditional definition. Another concern of cell characterization is stimulus generation. Current standard cell libraries can handle only single switching input, that is, it cannot deal with multiple switching inputs that are frequently encountered in a real operation. This is another shortcoming of cell characterization. Although an exhaustive enumeration of all input states satisfies the requirement of stimulus, it is a waste of simulation time. Therefore, selecting a minimal set of input vectors is seriously considered to reduce simulation time. The characterized cells have several common attributes such as output transition times, propagation delays, internal switching power, leakage power, input pin capacitances, and cell area. Sequential cells have additional requirements of characterizing relative signals. Relative signals are signals that are timing-critical to another signals state. Relative signals include setup, recovery, hold, and removal time [6].
On the long path, the maximum time available for evaluation of combinational logic in one clock period, tCLmax, is given by tCLmax = tCYeff (tSU2+tCQ1) where tCYeff is the effective cycle time, tSU2 is the setup constraint of the second sequential cell, and tCQ1 is the Clock to Q propagation time of the first sequential cell. On the short path, if the next state value from the second sequential cell reaches the first sequential cell during the hold time of the first sequential cell, the next state value will corrupt the current state value of the first sequential cell. The minimum propagation time, tCLmin, through combinational logic on the short path is expressed by tCLmin = tSK + (tH1 tCQ2) where tSK is the clock skew between clocking of both sequential cells, tH1 is the hold constraint of the first sequential cell, and tCQ2 is the Clock to Q propagation time of the second sequential cell. Recovery and removal constraints describe the timing requirements on the control signals, such as preset or clear, with respect to the clock signal. A sequential cell needs some time to be out of the influence of the control signal after the control signal becomes inactive. This time is referred to as recovery constraint. Therefore, the control signal should become inactive at least a time (recovery constraint) before clocking in order to insure the clocking effective. On the other hand, removal constraint is the minimum time for control signal to influence the latched value. If a control signal becomes inactive before the removal constraint, the control signal will not affect the operation of the sequential cell. In the simulations for the timing attributes mentioned above, the minimal stimulus comprises input vectors that cause an output transition, although the necessary input vectors are different according to each attribute. Input vectors can be subdivided into data signals, control signals, and clock. More details of stimulus for each attribute are described in the following sections.
voltage range to 90% of the voltage range). The propagation time is measured between predetermined delay threshold value of input signal and that of output signal (e.g., 50% of the voltage range of input signal to 50% of the voltage range of output signal). A data transition that can cause an output transition is the required condition of the stimulus for both attributes. In sequential cells, the Clock to Q propagation time can be obtained from the clock transition. In the level-sensitive clocking sequential cells such as LATCH, the D to Q propagation time can also be obtained with the clock in the active state.
Fig. 9 illustrates the setup and hold constraints of an edge-triggered sequential cell where clock is high active. Fig. 10 illustrates the setup and hold constraints of a level-sensitive sequential cell where clock is high active.
Fig. 11 illustrates recovery and removal constraint of an edge-triggered sequential cell where clock is high active and control is low active.
Bisection method
In order to measure relative signal characterization, the bisection method is used. Bisection is a method of optimization that employs a binary search to find the value of an input variable associated with a goal value of an output variable. This method uses a binary search to locate the output variable goal value within a search range of the input variable by iteratively halving that range to converge rapidly on the target value. The measured value of the output variable is compared with the goal value every iteration [13].
Fig. 12 Setup constraint search using bisection Fig. 12 shows how to determine setup constraint with bisection method where the goals are the output transition and the allowable Clock to Q propagation time. To start the binary search, a lower boundary and an upper boundary are specified. Data transition 1 at the lower boundary is early enough to cause a good output signal. Data transition 2 at the upper boundary is too late to
change output signal. This means that the candidate for setup constraint exists between the upper and the lower boundaries. Consequently, the bisection algorithm tests data transition at the midpoint between both boundaries. Data transition 3 at the mid-point changes output signal but causes a long propagation time, that is, data transition 3 does not meet the goal. The bisection algorithm sets the mid-point as the new upper boundary. Given the new range, the bisection algorithm tests data transition at the new mid-point. If the output value satisfies goals, the new mid-point is set as the new lower boundary. Otherwise, the mid-point is set as the new upper boundary. Then the bisection algorithm tests data transition at the new mid-point within the new range again. The bisection algorithm iterates setting new boundary and mid-point until the binary search reaches a process-termination criterion. Data transition 4 is the latest data transition that satisfies the goals. Therefore, setup constraint, tSU, is given by tSU = t2 t1
Dynamic power is the power dissipated when a circuit is active. Dynamic power is divided into switching power and internal power. Switching power results from charging/discharging of load capacitance. Switching power is calculated by a gate level power analysis tool where the interconnect parasitic is known [9]. Therefore, switching power is excluded from power characterization of cells. While input or output signals switch, power is also dissipated by internal capacitive charging/discharging and short circuit dissipation. Since this kind of power is dissipated in the cell during signal switching, it is called internal power.
where I(t) is current at power node, Vdd is source voltage, and CL is load capacitance. Consequently, it is no wonder that negative power values are found in the power-annotated library that is generated by PowerArc. Internal power is overestimated at rising output transition and underestimated at falling output transition. Nevertheless, these internal power values are acceptable if internal power dissipation is considered for a given period.
Ileakage
ISC
Iinter-node Cinter-node CL
ICL
ICL CL
methods
achieved
Another possible approach to improve the quality of the final circuit is wisely choosing the logic functions in the target cell library. In order to examine how each logic function group mentioned in section 2.1 affects the quality of the circuit and how synthesis tool picks up cells to satisfy constraints, a set of benchmark circuits are synthesized, placed & routed, and resynthesized using the libraries shown in Table 1. These libraries are formed by selectively choosing logic function groups from Artisan standard cell library in TSMC 0.18-micron technology.
Lib 1 Tri-state Sequential Negative unate Positive unate Inverted input Arithmetic Mux Low skew Lib 2 Lib 3 Lib 4 Lib 5 Lib 6 Lib 7 Lib 8
x x x
x x x x
x x x x
x x x x x x
x x x x x x x
x x x x
x x x x x
x x x x x x x x
As shown in Fig.15, initial synthesis and place & route are carried out to get the initial layout. Then the decent wire load model from the initial layout is extracted and the cell library is upgraded with the extracted wire load model. With the upgraded cell library, synthesis and place & route are performed and the layout is generated again. From the layout, parasitics can be extracted. Those parasitics are used for input of re-optimization in aspect of cell size. During the re-optimization, cells with new size replace the cells chosen at the synthesis step. For the input of timing and power analysis, resynthesized circuit is placed & routed again and parasitics are extracted from the final layout. Design Compiler from Synopsys is used for synthesis. Silicon Ensemble from Cadence is used for place & route, and clock tree generation. HyperExtract from Cadence is used for wire load model extraction and parasitic extraction. Library Compiler from Synopsys is used for upgrading cell library with the extracted wire model. PrimeTime from Synopsys is used for timing analysis. NanoSim from Synopsys is used for power analysis.
depends on the operation speed of a circuit and what kind of circuit is designed.
10
Four benchmark circuits were used for the experiments. In order to observe how each library affects datapath-dominated circuits and controllerdominated circuits, two of them (VP2 and CMUDSP) are chosen from Digital Signal Processor (DSP) and the others (GPIO and CAN) are chosen from Controller. Arithmetic functions are heavily used in DSP while it is less frequently used in Controller. Using the standard cell design flow described in section 3.1, each benchmark circuit is designed with various target clocks and different libraries.
3.3. Results
Power (or area) vs. delay plots are an effective way to compare cell libraries since the efficiency of achieving a particular delay is important [4]. In order to make power (or area) vs. delay plots, a set of benchmark circuits are designed using different libraries within some target clock range. Through timing analysis and power analysis, power (or area) vs. delay plots are obtained. From this study, several interesting phenomena are found. The largest library, lib 8, does not always produce the best result. Instead, the choice of the best library
Delay [ns]
Fig. 17 Target clock vs. delay for CMUDSP As shown in Fig. 18 and 19, the library, lib 4, including complex cells such as arithmetic and mux cells shows good area efficiency at the high delay region while adding inverted input cells to them (lib 5 and lib 8) somewhat degrades area
efficiency at the high delay region. But, as the delay decreases, lib 4 is prone to increase the area of design quickly. Consequently, lib 4 results in worse area efficiency than lib 5 or lib 8 at the middle or low delay regions. This means that for good area efficiency at high speed, inverted input cells are needed.
700 600 500 400 300 200 100 0 AND ADD XOR MX 8ns 9ns
260000 Library 1 Library 2 220000 Cell area Library 3 Library 4 200000 Library 5 180000 Library 6 Library 7 160000 Library 8
240000
500000 490000 480000 Library 2 470000 Library 3 Cell area 460000 Library 4 450000 Library 5 440000 Library 6 430000 Library 7 420000 410000 400000 4 5 6 7 Delay [ns] 8 9 10 Library 8 Library 1
8ns 9ns
Fig. 19 Delay vs. area curves for CMUDSP As delay becomes close to the lowest delay, all libraries show the similar area efficiency in Fig. 18. This can be explained by the decomposition of complex cells. As expected, the synthesis tool basically increases the ratio of cells with higher drive strength in the circuit as target clock frequency increases. When this simple increase reaches the limit, complex cells such as positive unate and arithmetic cells are additionally decomposed into relatively simple cells. These decompositions come with the cost of the sudden increase of total cell count as well as the increase of total cell area. Therefore, the area benefits of complex cells are reduced. These decompositions of complex cells are observed in the VP2 benchmark. As the target clock changes from 9ns to 8ns, the total cell count increases from 4208 to 7148. Fig 20 shows the complex cell count and Fig 21 shows the relatively simple cell count at both target clocks (8ns and 9ns).
At the low delay region, lib 1 produces the best power efficient circuits as shown in Fig. 22 while all libraries tend to build circuits similar in area efficiency. On the other hand, at the high delay region, the power efficiency is similar over the all libraries as shown in Fig. 23. From this result, it is apparent that the power density of complex cells such as arithmetic and mux cells is higher than the power density of negative cells. Therefore, the power considering design is proper to high-speed design while the area efficiency is considered for design at low speed circuits
70000 65000 60000 Avg. power [uW] 55000 50000 45000 40000 35000 30000 25000 6 7 8 Delay [ns] 9 10 Library 1 Library 2 Library 3 Library 4 Library 5 Library 6 Library 7 Library 8
10
IN V N O R N AN D AO I O AI M X XN I O R O R
115000 105000 Library 1 95000 AVG power [uW] Library 2 85000 75000 Library 5 65000 55000 Library 8 45000 4 5 6 7 Delay [ns] 8 9 10 Library 6 Library 7 Library 3 Library 4
the available delay range is narrow. The target clock vs. delay curves of the GPIO benchmark also shows the narrow delay range. Therefore, it is hard to distinguish delay regions in controller circuits. The increase of the ratio of cells with higher drive strength and the lowest drive strength also occurs as the clock becomes fast. The decomposition of positive cells is observed. Even though the decomposition of positive cells happens, that does not increase the total cell count and the area as much as the decomposition of arithmetic cells. Within narrow delay range, irregular power (or area) vs. delay curves are scattered. Therefore, it is hard to find which library produces the best quality of circuits.
Fig.23 Delay vs. power curves for CMUDSP One more interesting thing is that the ratio of cells with the lowest drive strength (xL) increases as operation frequency of the circuit gets fast as shown in Fig. 24. This result explains that the lowest drive strength cells are needed to form longer buffer trees even though it has higher capacitance due to larger active area than cells with a time drive strength (x1).
14 12 10 8 6 4 2 4 6 8 10 Delay [ns ]
[%]
Fig. 25 Target clock vs. delay curves of CAN Given the target clock range (0.5~6ns) that is used in the experiments, the target clock vs. delay curves of the CAN benchmark show the high plateau region as well as the low plateau region as shown in Fig. 25. Differently from DSP circuits,
11
REFERENCE
[1] Ken Scott and Kurt Keutzer, Improving Cell Library for Synthesis , Proc. of Custom Integrated Circuit Conference (CICC), pp. 128-131, 1994 [2] S. Gavrilov, A. Glebov, S. Pullela, S. C Moore, A. Dharchoudhury, R. Panda, G. Vijayan, and D. T. Blaauw, Library-Less Synthesis for Static CMOS Combinational Logic Circuits , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 658-662, 1997 [3] Gregory A. Northrop and Pong-Fei Lu, A Semi-Custom Design Flow in High-Performance Microprocessor Design , Proc. of Design Automation Conference (DAC), pp. 426-431, 2001 [4] Miodrag Vujkovic and Carl Sechen, Optimized Power-Delay Curve Generation for Standard Cell ICs , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 387-394, 2002 [5] K. Keutyzer and E. Girczyc, Panel: Cell libraries build vs buy; static vs. dynamic , Proc. of Design Automation Conference (DAC), pp. 341-342, 1999 [6] Teri Hike, McFaul and Karl Perrey, Characterizing a Cell Library using ICCS , Proc of ASIC Seminar and Exhibit, pp. p12/5.1-p12/5.4,1990 [7] David S. Kung and Ruchir Puri, Optimal P/N Width Ratio Selection for Standard Cell Libraries , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 178-184, 1999 [8] Prolific User Guide [9] Binay Ackalloor and Dinesh Caitonde, An overview of Library Characterization in Semi-Custom Design , Proc. of Custom Integrated Circuit Conference (CICC), pp. 305-312, 1998 [10] Anantha Chandrakasan, William J. Bowhill,, and Frank Fox, Ed. Design of High-performance Microprocessor Circuits , IEEE Press, New York, pp 215-218, 2001 [11] Neil H. E. Weste and Karman Eshraghian, Ed Principles of CMOS VLSI design , Addison-Wesley, pp 317-325, 1994 [12] Design Compiler User Manual [13] Star-Hspice Manual [14] PowerArc User Guide [15] Eric Lehman, Yosinori Watanabe, Joel Grodstein, and Heather Harkness, Logic Deocomposition during Technology Mapping , Proc. IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 242-245, 1995 [16] Chi-Ying Tsui, Massoud Pedram, and Alvin M. Despain, Technology Decomposition and Mapping Targeting Low Power Dissipation , Proc. of Design Automation Conference (DAC), pp. 68-73, 1993 [17] Randal E. Bryant, Graph-Based Algorithms for Boolean Function Manipulation , IEEE Transactions on Computers, pp. 677-699, 1985
12