Professional Documents
Culture Documents
School of Engineering
VLSI Lab
Data Driven
Clock Gating
Academic Advisor: Prof. Shmuel Wimer
Instructor: Mr. Moshe Doron
Industry correspondent: Mr. Roey Mioni
Dov Gropper
Dvir Shasha
Final Fourth Year Project
Computer Engineering
Table of Contents
Main Project Goals........................................................................................................................ 3
Motivation ..................................................................................................................................... 3
Theory .......................................................................................................................................... 4
Design Flow .................................................................................................................................. 7
Design: ...................................................................................................................................... 7
Simulation environment: ............................................................................................................ 8
Iterative Perfect Matching Algorithm (IPM): ............................................................................... 8
Clock gating Implementation: .................................................................................................... 9
Hardware and Design Components .............................................................................................10
Problems and Solutions ...............................................................................................................12
Direct Memory Accesses Controller .............................................................................................14
Behavior ...................................................................................................................................14
System level .............................................................................................................................14
The block diagram of the DMA controller's state machine: .......................................................16
The Design: ..............................................................................................................................17
Top design, with Verification diagram: ......................................................................................18
Results.........................................................................................................................................20
The SpyGlass Results: .............................................................................................................21
Result review: ...........................................................................................................................24
Conclusions .................................................................................................................................25
References and Sources..............................................................................................................27
Appendixs ....................................................................................................................................28
Motivation
The increasing demand for low power mobile computing and consumer electronics
products has refocused VLSI design in the last two decades on lowering power and
increasing energy efficiency. Power reduction is treated at all design levels of VLSI chips.
From the architecture through block and logic levels, down to gate level circuit and
physical implementation, one of the major dynamic power consumers in the system clock
signal, typically responsible for up to 50% of the total dynamic power consumption. Clock
network design is a delicate procedure, and is therefore done in a very conservative
manner under worst case assumptions. It incorporates many diverse aspects such as
selection of sequential elements, controlling the clock skew, the decision of the topology
and physical implementation of the clock distribution network.
Theory
Clock gating
Several techniques to reduce the dynamic power have been developed, of which clock
gating is predominant. Ordinarily, when a logic unit is clock, its underlying sequential
elements receive the clock signal regardless of whether or not they will toggle in the next
cycle.
Clock enabling signals are usually introduced by designers during the system and clock
design phases, where the inter-dependencies of the various functions are well
understood. In contrast, it is very difficult to define such signals in the gate level,
especially in control logic, since the inter-dependencies among the states of various flipflops depend on automatically synthesized logic. There is a big gap between block
disabling that is driven from the HDL definitions, and what can be achieved with data
knowledge regarding the flip-flops activities and how they are correlated with each other.
The research presents an approach to maximize clock disabling at the gate level, where
the clock signal driving a flip-flop is disabled (gated) when the flip-flop states is not subject
to a change in the next clock cycle.
Clock gating does not come for free. Extra logic and interconnects are required to
generate the clock enabling signals, and the resulting area and power overhead must be
considered. In the extreme case, each clock input of a flip-flop can be disabled
individually, yielding maximum clock separation. This, however, results in high overhead.
Thus, the clock disabling circuit is shared by a group of several flip-flops in an attempt to
reduce the overhead.
On the other hand, such grouping may lower the disabling effectiveness, since the clock
will disabled only when the inputs to all the flip-flops in a group dont change. It is,
therefore beneficial to group flip-flops whose switching activities are highly correlated in
derive a joined enabling signal.
This requires gathering statistical information of our flip-flops using simulations, and
statistical analysis.
Another issue that influences the effectiveness of this suggested technique is the fan-out
of the gater. The theory presents a formula for calculating the optimal fan-out of the gater,
referred to as k:
The graph above shows the normalized power net savings per flip-flop obtained by
adaptive gating at first level of clock tree in the equation above. The saving is compared
to the non-gated situation. The optimal fan-out is marked for each toggling probability:
Using the statistical information gathered and the optimal fan-out, we could attain groups
of matching flip-flops for the clock gating.
Design Flow
Design:
The design flow begins with a design in RTL. It is important to begin with a design that
has been proven to work properly. The design must not include any IPs (intellectual
property) or RTL sources that are not visible to the user, and therefore cannot be edited.
At this point, the design flow supports implementation for a single clock domain.
Moreover, the sequential and combinational logic must be separated in the RTL in order
for the scripts to run properly.
Simulation environment:
In this stage, simulations on the RTL are performed and statistical information is gathered
for analysis. This must simulate a typical use of the design, so that we can achieve
realistic statistical information. There is support currently for one simulation per design.
The simulation runs with Cadence's SimVision.
The Simulation environment steps:
Add tracing code to the design. This is obtained by running the program ftrc.exe.
To run this program, the user must first modify the file inputs.rti. In the file, the user
sets the following attributes:
Specifying the name of the design files list including extension (*.vc).
The program gives an option whether to get the output design as one or
multiple files.
Add tracing code to the test-bench manually. The code must be added before the
DUT instantiation.
At this point, the simulation can run.
The simulation outputs will contain two files:
Activities.rpt - the report file contains the active flip-flops per time in
millisecond.
FF_lists.rpt - the report file contains a list of the flip flops in the design.
10
Cadence's SimVision:
Simvision is the waveform viewer in the Cadence EDA suite. We mainly used it for
behavioral design verification before FPGA implementation. Due to the RTL changes
made, it was necessary to verify our design before implementing our design flow of Data
Driven Clock-Gating. After the implementation, we used it to make sure the design still
had the exact same behavior. We used a test bench with two OK signals that indicated
proper behavior of the design. It is crucial that we used the same test bench on both
designs, so that we can positively know that the two designs really had the same
behavior.
11
The problem was solved by moving the design flow to RTL, and allowing the Altera
Quartus to synthesize the design without these limitations.
In addition, this change also shortened the runtime of the entire design flow.
The Logical separation Problem
In the design flow, the clock gating implementation step required that the designs
sequential and combinational logics will be separated in the RTL.
Because of that, we separated the DMA controller sequential and combinational logics.
It is not ideal to change the design the original design in order to perform the flow.
However, the tools are still under development and will be more versatile in the future.
An explanation of these types of logic:
Combinational logic - circuits that implement Boolean functions.. These circuits are
functions of input only. An example of combinational logic:
Sequential logic - Like combinational logic circuits, a sequential logic circuit has inputs
and outputs. However, the output depends on the state of a FSM as well as the inputs.
Furthermore, it contains a clock.
An example of sequential logic:
13
Behavior
The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0
Device. The Device is dedicated to USB Communication Channel. It has the potential of
being integrated into the Protocol Engine (PE) Device. The DMAC function, within the
GOK Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit
(RX/TX) Packet Buffers and the Device Animation Graphics Engine (AGE) Function
Endpoints, in response to PE service requests. The DMAC is the only Bus Master in the
system. It is pre-configured to perform the required data transfers to and from the AGE
Application Function Core. The DMAC is capable of performing words gather-scatter,
support system data bus width up to 48bits (6 bytes) and up to 24bits address bus
(16Mbytes address range).
Flyby and gather-scatter data transfer modes are supported but memory to memory
transfers is not.
System level
The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data
transfers to and from the AGE Application Function Core. The DMAC Configuration
Memory contains the necessary information to access any Endpoint Buffer (Memory
or Register Files), in the AGE Core.
14
15
Notice that the flow splits left and right for the two directions: Rx path, and Tx path. Inside
each direction there are more splits, for different data sizes.
16
The Design:
This is a block diagram of the DMAC design. It is constructed from Data interface unit
(DIU), Finite State Machine unit and a Configuration ROM. On the left is the interface with
the protocol engine. On the right, is the interface with the System Bus, and function.
The next stage was to create an environment that would allow us to visually verify the
design on an FPGA board.
17
This block diagram represents the design that was implemented on the FPGA board.
In addition to the DMA controller we used a stimuli ROM triggered by an 8 bit counter to
resemble data from Protocol engine or from i.e. the AGE.
To confirm the correctness of the data transfer we used 77Bit comparator and a monitor
ROM. The comparator compared the transferred data to the expected result stored in the
monitor ROM and using two LEDs if the data was transferred correctly and also if the
DMA control signals were in the correct state.
The components:
8-bit counter: a regular 8 bit counter. Each clock the count is increased by 1. The
output will return to 0x00 upon reach of 0xFF or reset.
Stimuli ROM: a Read Only Memory component that contains the data that will be
pushed in the inputs of the DMAC. It is made of 57 bit words. It receives the
address from the 8-bit counter as an input.
18
Monitor ROM: a Read Only Memory component that contains the data that should
be the output of the DMAC according to the input address. It receives the address
from the 8-bit counter as an input.
77-bit Comparator: A unit that compares the expected data (from monitor ROM)
to the collected data (from the DMAC). It splits the comparison into two: data, and
control signals. If the expected and collected are identical - both LEDs should be
on.
And so, if both LEDS are on during running- the design is working properly. It is important
to note that during reset, only the data OK will be on.
After debug work of the test bench, we achieved two working designs- with and without
Data Driven Clock Gating.
19
Results
In parallel to our work this year, our flow was run on designs at CEVA.
The VLSI department at CEVA already used clock gating in their design flow.
Their gaters are based on control signals. That means that if the entire clock domain is
not functioning at a given time, the clock signal is blocked and is not forwarded to the
specific clock domain.
The clock gaters we suggest in the design flow are based on data and statistical
information.
The data driven clock gaters were added to the design additionally to the control driven
clock gaters. This fact limited the process in terms of power reduction, because the
design was already power reduced.
To prove the potential of the design flow an activity test was made on the DUT. In this test
Flip Flops that did not needed a clock signal were sampled:
The table above shows that almost 98% of the Flip Flops active only 0-5% of the entire
test. This means that there is potential of saving power by implementing the technique on
the design. However, that is not enough to insure that saving is possible. It is also
necessary to show that many flip-flops have high correlation between their clock-toggle
vectors, in order to gate them together. The following graph shows just that:
20
The X-axis is the correlation percentage. The Y-axis is the number of flip-flops with the
appropriate correlation percentage. As can be seen, there are a very small percentage of
flip-flops with low correlation, and a very large percentage of flip-flops with high
correlation.
Now we can soundly predict high power saving potential.
After implement the entire design flow on three different designs and masure power with
simulation program, Spyglass, the results received in CEVA were:
21
The tables below shows the detailed results received with Spyglass on Design C:
Golden design
Leakage
Total Power:
337uW
12.7mW
40.0mW
53.0mW
Combinational Power:
224uW
2.65mW
22.3mW
25.2mW
Sequential Power:
95.2uW
8.96mW
1.05mW
10.1mW
0W
0W
10.3uW
10.3uW
Memory Power:
0W
0W
0W
0W
IO PAD Power:
0W
0W
0W
0W
Clock Power:
17.8uW
1.10mW
16.6mW
17.7mW
Internal
Switching Total
Above is the power measurement report that was derived from analysis of the golden
design. This means that no data-driven clock gating was performed on the design.
The next table shows the main power consumption data according to a given k. This
means that the data-driven clock gating process ran, and a separate design was created
for each gater fan-in size.
22
Leakage
Internal
Switching
Total
golden
337uW
12.7mW
40.0mW
53.0mW
k=4
415uW
10.6mW
49.6mW
60.6mW
k=8
398uW
9.04mW
42.8mW
52.3mW
k=16
388uW
8.83mW
40.2mW
49.5mW
k=32
387uW
9.52mW
40.3mW
50.2mW
k=64
385uW
10.3mW
41.8mW
52.5mW
k=128
383uW
11.0mW
41.2mW
52.6mW
23
The following table shows the power consumption data for k=16 fan-in, except for some
variations that were done outside of the design flow.
Total
Switching
Internal
Leakage
46.3mW
37.1mW
8.88mW
378uW
Total Power:
26.3mW
23.2mW
2.82mW
255uW
Combinational Power:
6.86mW
1.62mW
5.14mW
100uW
Sequential Power:
10.8uW
10.8uW
0W
0W
0W
0W
0W
0W
Memory Power:
0W
0W
0W
0W
IO PAD Power:
13.2mW
12.2mW
917uW
22.9uW
Clock Power:
This design was 22% more efficient than the original golden design above.
Result review:
It can be noticed that with the k=16, the power saving is maximal. Also when the
fan-in is too small as in k =4 the power increases.
The combinational power increases with Data Driven Clock Gating as a result of
the extra logical component, the gaters. But the sequential power and the clock
power decreases more significantly because of the clock disabling techniques.
Although the design already had control driven clock gating the activity test shows
that there is still room to save power because the activity of 98% of the Flip Flops
were low and the correlation between the most of them were high.
24
Conclusions
The results that have been shown in the last chapter have proven beyond doubt that
Professor Shmuel Wimers research Data-Driven Clock Gating is a practical and
efficient power reduction tool. The design flow that was developed during this project
made the research a practical tool that could transform a given RTL design into a more
energy efficient one.
Design flow review:
The ability to work in RTL mode saved a lot of runtime of the design flow and made
it more effective. This issue change becomes more relevant, and even crucial,
when implementing this design flow on a large design. That is due to exponential
growth of runtime in every stage of the design flow.
We added overhead to the design in the form of logical components, the gaters.
The ability to combine a number of Flip Flops together with statistical knowledge as
a tool was the power saving main element. Both of these aspects appeared in the
result in the form of decrease and increase of power in the final design.
Even when a design has clock gaters driven by control the Data-Driven Clock
Gating proves effective. The fact that most of the Flip Flops were not active in most
of the run time, and the high correlation between most of them made it possible to
decrease power despite the control driven gaters.
There is still room for improvement of the versatility and user friendliness of the
scripts and the design flow. The disadvantages of the scripts create a need to
change the design. This happens because the scripts cant handle a design that
has both combinational and sequential logic mixed. In addition, the scripts wont
work on a design that has a synchronous reset. The code addition to the testbench necessary for the tracing stage should be done as part of the flow (by one of
the programs) and not manually. It would be Ideal to create a main program with a
user interface (GUI) that would combine the entire design flow. That way, the flow
would be easier to run and more user friendly.
There is still a need to achieve results in ASIC to confirm the efficiency of
implementing Data-Driven Clock Gating.
The need of a good simulation that mimics a real application use of the design will
have significant influence on the effectiveness of the design flow. This is due to the
fact that the technique is based on statistics and correlation and the more realistic
the simulation the statistical results would be accurate.
Our attempts to measure the power consumption on the FPGA boards were not
successful. The reason was that the boards has a tremendous static power
25
consumption level, due to all its BRAMs and LUTs. Even after multiplying the
design 100 times and measuring the power consumption with the ISE Chipscope
using the built in 0.005 ohm serial resistor - the power difference was not apparent.
That is probably the reason FPGA boards are used in the industry in order to
check design integrity of low power devices, and the actual devices are
manufactured using ASIC.
26
27
Appendix A
DMAC Spec:
Transceiver
Chip (PHY) UTMI
Protocol
Engine
DMAC
AGE
System Bus
Fig. 1 - GOK USB2.0 Device System Block Diagram
The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data
transfers to and from the AGE Application Function Core. The DMAC Configuration Memory
contains the necessary information to access any Endpoint Buffer (Memory or Register Files),
in the AGE Core.
The PE issues a Transaction Request command signal and a Packet Transfer Request signal to
the DMAC, for a specific AGE Endpoint. The DMAC responds with an Acknowledge signal
to the PE and starts data transfer transactions between PE Packet Buffers and EP Buffers
Registers or Memory, over the system bus by issuing Endpoint Buffer Address, Read and
Write control signals, while monitoring AGE Wait signal (for slow Memories). Data transfers
are performed in either single bus cycle 16bit words data transfer (flyby mode) or in multiple
bus cycles (gather-scatter mode), to match different source and destination bus widths. In both
single packet and multiple packets data transfers, terminating specific EP Input Transaction
(from EP to Host), is done by the DMAC monitoring the End-Of-Transaction signal, issued by
the Function (last EP Buffer address reached). In case of Output Transaction (from Host to
EP), if last packet size is smaller than the pre-defined EP MaxPacketSize or packet having
data size = 0 (zero), the PE de-asserts its Transfer Request signal. In case of multiple packets
28
data transfers, only the Packet Transfer Request signal is de-asserted and the DMAC will carry
on with next packet data transfers as soon as the Packet Transfer Request signal is be asserted.
When both Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle
state and is ready to perform the next transfer request.
The DMAC access PEs RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated PE
read/write signals. Data is transferred over the system data bus, as 16bit words.
USB2.0-aware DMAC
DMAC's three main modules are the Control Core (FSM), Configuration ROM and the Data
Interface
Unit (DIU).
2.1. DMAC Top Level Introduction
Control
Core (FSM)
Protocol
Engine
DIU
System
Bus
Configuration
ROM
Fig.
DMAC Block
Diagram
2.2. DMAC Modules
The DMAC is partitioned into modules as shown in Fig. 2 Block Diagram and described
below.
2.2.1. Configuration ROM
The Configuration ROM contains the essential information necessary to access any
pre-defined Application Function Endpoint Buffer (Memory or Register Files). The
Configuration information enables the DMAC to properly carry out the data
transactions, requested by the PE. Since PE issues at transaction request time,
Endpoint's transfer direction (IN-OUT), and Endpoint number (1-15), the specific
Endpoint Buffer can be selected, but EP Buffer data width (DW) must reside within
the Configuration ROM.
2.2.2. Control Core
The Control Core is the main Finite State Machine (FSM), handling all Device
operations.
At system boot time, the DMAC enters its Idle State, ready to carry data transfers.
It operates under the PE control.
The Control Core translates PE requests to data transfer actions, according to the
information stored in the Configuration ROM. PE initiates Data transfer operation by
Transfer Request signal assertion and Endpoint info (4bit EP number + 1bit in/out).
PE requests are being transferred to the Control Core. The Control Core employs the
pre-programmed EPs Buffer Data Width (DW) information, to perform either a
flyby transaction or a gather-scatter transaction. With each bus data transfer, the value
in the current address counter is driven onto the address bus, and the current address
counter is automatically incremented. At transaction completion (single or multiple
29
R3
32
31
R2
24
23
R1
16
15
Gather-Scatter operation:
- IN (from EP to PE TX FIFO)
32bit: Read 32bit word into R1-R2-R3. Write 2 16bit words from R1 & R2+R3.
24bit: Read 2 24bit words into R1+R2 & R3+R4. Write 3 16bit words from R1,
R2+R3 & R4.
48bit: Read 48bit word into R1-R2-R3-R4. Write 3 16bit words from R1, R2+R3 &
R4.
- OUT (from PE RX FIFO to EP)
32bit: Read 2 16bit words into R1 & R2+R3. Write 32bit word from R1-R2-R3.
24bit: Read 3 16bit words into R1-R2-R3-R4. Write 2 24bit words from R1+R2 &
R3+R4.
48bit: Read 3 16bit words into R1-R2-R3-R4. Write 48bit word from R1-R2-R3-R4.
2.3. Interfaces
Signal Name
dbus[47:0]
abus[23:0]
nrd
Signal Type
Description
Data Bus. These pins serve as input and output System data bus
Bi-directional
(for local C, PE Packet Buffers and Application Buffers
Address Bus. Serves as System Address Bus for the DMAC.
Bi-directional
16 LSBs are used by the C to access the Control Registers.
Bi-directional System Read signal issued by Bus Masters (DMAC or C)
30
nwr
npbrd
npbwr
nwait
ndack
ntreq
npreq
neot
epn[3:0]
ep_dir
clk
nrst
Vcc
Vss
Bi-directional
Out
Out
In-Active low
In-Active low
In-Active low
In-Active low
In-Active low
In-Active Hi
In-Active Hi
Input
In-Active low
Input
Input
Note: DMAC uses Endpoint number (epn[3:0]) and Transaction direction (ep_dir), as
internal ROM address, to perform the expected data transfer to/from the specific End Point
@ the Function Core. They serve as chip selects for the Buffers within the Function Cores.
DMAC also issues nrd/nwr, npbrd/npbwr signals and current EP Buffer address, to handle
data transfer. Control signals npbrd or npbwr are used by the PE to drive RX FIFO output
data onto the system data bus or to latch the data from the system bus to the TX FIFO,
depending on transfer direction.
2.4. Programming Model
2.4.1. Configuration ROM
The ROM holds the configuration data. A single function within a single
configuration, having up to 15 OUT and 15 IN Endpoints is supported.
A set of 15 OUT & 15 IN Endpoints information is pre-programed in the ROM.
Information for each Endpoint includes:
- EP Data Width (DW in 16bit words) - 2bits/EP [00-16bit, 01-32bit, 10-2x24bit, 1148bit].
Default - 00.
Data per EP 2bits. 16bits ROM Word holds DW info of 8 Endpoints.
15 EPs/Dir/Function information, are stored in 2 16bits ROM Words.
Total number of Configuration ROM size is 2 x 2 = 4 16bit ROM Words.
Individual ROM Words are accessed via internal 2-bit address bus.
2.4.2. ROM Words Data Formats
Fn_EPn_I/O Registers EP Data Size (Data Width units)
MSB
LSB
D15 D14
D13
D12
D11
D10
D9
D8
D7
D6
D5
D4
D3
D2
D1
|-----EP8-----|-----EP7--------|------EP6-----|-----EP5----|-----EP4-----|-----EP3-----|-----EP2------|---EP1----|
31
D0
D15
D14
D13
D12
D11
D10
D9
D8
D7
D6
D5
D4
D3
D2
D1
D0
|----EP15------|-----EP14----|----EP13-----|----EP12----|----EP11----|----EP10----|----EP9-----|
Register Name
F1_EP1_8_O
F1_EP9_15_O
F1_EP1_8_I
F1_EP9_15_I
Register Description
Function #1, Output EPs 1-8 DW
Function #1, Output EPs 9-15 DW
Function #1, Input EPs 1-8 DW
Function #1, Input EPs 9-15 DW
3. Implementation
The DMAC is designed as a Front-End for near future ASIC implementation. It is designed
using Verilog HDL and simulated/logically verified for correct operation, using Cadence
Incisive Simulator.
Intermediate Hardware Implementation, for proof of concept and correct functionality, is
performed using FPGA Device, located on Altera DE2 Development Board, under Quartus II
Development Environment. The Incisive logically verified Verilog code is used for
implementation.
Quartus II MegaFunction Wizard is not used.
There is an option to incorporate the Protocol-Aware DMAC into the USB2.0 Protocol Engine.
32
+Vcc
-Vss
nrst
XTAL
Oscillator
ndack
ntreq
npreq
ep_dir
epn[3:0]
clk
nrst
clk
USB2.0-Aware
DMAC
abus dbus
npbwr npbrd nwr nrd neot nwait [23:0] [47:0]
+Vcc -Vss
nrd
LDO
USB
Connector
Data Bus
+V -V
Address Bus
nwr
+D -D
nwr nrd abus
[15:0]
ntreq
dbus
[15:0] npreq
ndack
USB2.0 Protocol ep_dir
epn[3:0]
CLK
UTMI
USB2.0 PHY
Engine
epn[3:0]
Function
Core 0
nrst
npbwr
npbrd
RST
nrst
clk
+Vcc
+Vcc
clk
-Vss
-Vss
+Vcc
-Vss
these transfers are very efficient; however, memory to-memory transfers are not
possible in this mode.
4.1.2. Gather-Scatter DMA Transfer
This type of transfer is useful for interfacing devices with different data bus sizes. The
DMA employs a multiple-cycle, multiple-address data transfers, called Gather-Scatter
transfer.
The data being transferred is first read from the I/O device or memory into a
temporary DMA internal data registers. The data is then written to the memory or I/O
device in the next cycles.
This device has only single address counter and hence supports only memory-to- I/O
transfers.
34