You are on page 1of 27

Cisco Green Research Symposium

5 March 2008

FPGA-based ASIC
Design and Verification

Dejan Markovic
Electrical Engineering Department
University of California, Los Angeles
The Issues I am Going to Address

Š Power efficiency = energy efficiency ~ C·V2


Š Design complexity
Š Design re-entry
– Algorithm (Matlab or C)
– Fixed point description
– RTL (behavioral, structural)
– Test vectors for logic analysis

Š In this talk, I will demonstrate


– Power efficiency of 2.1GOPS/mW (90nm CMOS)
– 70GOPS in 3.5mm2
– FPGA-based design and verification

2
Optimization Approach
Š Power efficiency: circuit-level (C,V)
Š Performance and area: architectural techniques
Š Unified Simulink description

Automated environment for optimal hardware


Macro Arch.
E design and verification
E&A
D
A
Micro Arch. y
GPIO
ASIC
in out

12,9 y
U [4x4]
10,8
E Sigma

14,9
r [4x4]
W [4x4]

tr.per PE U-Sigma

Circuit EN

r [4x4]
RY
AZ

y [4x1]y [4x4] AZ

E ky [4x1] AZ

KY 8,5 FPGA
optimization hardware design I/O verification
3
Circuit-Level Optimization Framework
Š Sensitivity based optimization

Energy
– Balance sensitivity to all variables SA (A0,B0)

– Variables: gate size, VDD, VTH f (A,B0)


SB
f (A0,B)

D0 Delay

topology B

Energy
Constraints
topology A

Delay
Š Reference design Goal: find optimal E-D
– Dmin sizing @ Vddmax, Vthref tradeoff for a datapath
4
Circuit-Level Results: Tree Adder

15
S
1.5 Energy map
SW→∞
SVth=0.2

0
S
SW=22 SVdd=1.5

)
15
,B
1 SVth=22

15
(A
ref
SVdd=16
E/Eref

65%

)0
,B

in
C
0
(A
0.5
SW=1 i on
t
SVth=1
m iz a
SVdd=1 p ti
O
0
0 0.5 1 1.5
D/Dref

[D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, JSSC Aug’04]

E-D space is the key for architecture optimization


5
Scaling Impacts Architecture

10
Process: Architecture:
L-Vt P2: parallel 2
S-Vt T2: time-mux 2
P2 T2
Eop / Eref @ Vddmax

H-Vt
1

0.1

P2 g
T2 a lin
Sc

0.01
0.1 1 10
Top / Tref @ Vddmax

6
Simulink to Silicon Mapping
MDL to RTL conversion, automated P&R flow

Simulink
Fix-pt lib

MDL
Custom Speed
tool 1 Power
Area

ASIC
backend

[R. Davis et al., JSSC Mar’02]

7
Including FPGA Emulation
XSG hardware library, RTL translation scripts

Simulink
Hw lib

RTL
Custom Speed
tool 2 Power
Area

FPGA ASIC
backend backend

ASIC and FPGA


are I/O equivalent
[K. Kuusilinna et al., book chap. in SoC Revolution, KAP 2003]

8
Closing the Loop: I/O Verification
I/O hardware library, automated FPGA flow

Simulink
I/O lib Hw lib

RTL
Custom Custom Speed
tool 3 tool 2 Power
Area

FPGA ASIC
backend backend

FPGA implements
ASIC logic analysis
[D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, R.W. Brodersen, CICC’07]

9
Design Approach

Š Unified Simulink design


environment
– Enter design once!
– Algorithm verification
– Macro-architecture
– FPGA based ASIC
debug

Š Hardware-equivalent
Simulink blocks
– Add, mult, shift, mux…
● Word-size, latency

10
Block Characterization

Library blocks / macros Cycle Time Pipeline logic scaling


synthesized @ VDDref FO4 inv simulation

VDD scaling
mult gate sizing
Speed
Power
TClk @ VDDopt
Area
TClk @ add VDDref
VDDref

Latency 0 Energy
Goal: balanced logic depth and E/D sensitivity
11
Methodology for Architecture Selection
Š Energy-Area-Delay space for architecture comparison
– Time-mux, parallelism, pipelining, VDD scaling, sizing…

Energy
Block-level Datapath

line
Initial design Initial design
pipe

l
lle

pa
ra

r,
gate sizing gate sizing
pa

pip
Optimal design
intl, fold x intl,
mu Optimal fold tim VDD scaling
e- e-m
tim design ux

Area 0 Delay
12
Example: 4x4 SVD Algorithm
Š This complexity is hard to optimize in RTL
– 270 adders, 370 multipliers, 8 sqrt, 8 div
– Recursive LMS-based algorithm (nested feedback loops)

wi(k) = wi(k–1) + µi · [ yi(k) · yi†(k) · wi(k–1) – σi2(k–1) · wi(k–1)]


σi2(k) = wi†(k) · wi(k)
( i = 1, 2, 3, 4 )
ui(k) = wi(k) / σi (k)
2

UΣ LMS UΣ LMS UΣ LMS UΣ LMS

y1(k) Deflation Deflation Deflation


Antenna 1 Antenna 2 Antenna 3 Antenna 4

yi+1(k) = yi(k) – [ wi†(k) · yi(k) · wi(k)] / σi2(k)


13
Energy/Area Optimization
Š Starting point: fixed architecture

Energy
Interl. Fold
13.8x 2.6x 16b design

Area 0 Delay
14
Energy/Area Optimization
Š Step 1: Word-length optimization

Energy
Interl. Fold
13.8x 2.6x 16b design

30% word-size 30%

Area 0 Delay
15
Energy/Area Optimization
Š Step 2: Gate size & VDD optimization

Energy
Interl. Fold
13.8x 2.6x 16b design

30% word-size 30%


Initial synthesis
20% sizing 40%

Area 0 Delay
16
Energy/Area Optimization
Š Step 2: Gate size & VDD optimization

Energy
Interl. Fold
7x
13.8x 2.6x 16b design

30% word-size 30%


Initial synthesis
20% sizing 40%

Final design Optim. VDD scaling


VDD, W

Area 0 Delay
17
Hardware Results
Š Result of Energy-Area-Performance Optimization
Comparison with ISSCC chips
100
2004 SVD
18-5

Area efficiency
10

(GOPS/mm2)
1998 1998
18-6 7-6 1998
1 18-3
2000
4-2 1999
15-5 2000
0.1 14-8
(90nm ST Micro) 2000
14-5
Š 2.1 GOPS/mW 0.01
– 70 GOPS @ 100MHz 0.01 0.1 1 10
– Power = 34mW Energy efficiency
Š 20 GOPS/mm2 (GOPS/mW)
– 70 GOPS in 3.5mm2 [D. Markovic, B. Nikolic, R.W. Brodersen, JSSC Apr’07]

Functional test was performed with FPGA


18
FPGA Based ASIC Verification

ASIC + + = ASIC
I/O I/O
TB TB

Š Goal: use Simulink testbench (TB) for ASIC verification


– Develop custom interface blocks (I/O)
– Place I/O and ASIC RTL into TB model

Simulink implicitly provides the testbench

19
Simulink I/O Test Model for the SVD

Emulation-based ASIC I/O test


20
Experimental Setup

ASIC board

GPIO

FPGA board

Real-time at-speed ASIC verification


21
Measured Functionality
4x4 MIMO channel tracking
12
theoretical
10 hardware
σ12
Eigenvalues

6 σ22
4
σ32
2
σ42
0
0 8 16 24 32
Number of Symbols [k]

Up to 10 b/s/Hz with adaptive PSK


22
From Simulink to Optimized Hardware
Direct mapped DFG Æ Scheduler Æ Architecture Solutions Æ Hardware
(Simulink) (C++ / MOSEK) (Simulink/SynDSP) (FPGA/ASIC)

Initial DFG ti on Flow


e Ge nera
+ +
tec tur
Archi
D D
ated
D
+ × 2D × + A ut o m
a 2D b
× ×

Folding N = 4
c d

Architecture 2
Folding N = 2
Architecture 1
Direct-mapping
Reference

Resulting Simulink/SynDSP
Architectures

ILP Scheduling & Bellman-Ford Retiming: optimal + reduced CPU time


23
Energy-Area-Performance Map
Š Each point on the surface is an optimal architecture automatically
generated in Simulink after modified ILP scheduling and retiming
Valid Direct-mapping
ing (reference)
architectures retim

1 Constraints

par
0.8 a lleli
sm
tim
Area

0.6 e-m
ux
0.4

0.2
1
0.8 1
Pe g
rfo 0.6 scalin 0.8
rm V DD 0.6
an 0.4 0.4 rgy
ce 0.2 0.2 Ene

Š System designer can choose from feasible optimal solutions


Š It is not just about functionality, but how good a solution is, and
how many alternatives exist
24
Conclusions
Š Simulink provides level of abstraction needed for
complete ASIC development
– Hardware emulation of algorithms
– Technology-driven architecture selection
– FPGA-based ASIC verification
● Logic analysis can be fully ported onto FPGA

Š Energy-area-delay space is a compact way for


comparing multiple architectural realizations
– ILP-based formulation automates architecture design

Š Complex algorithms in 90nm can achieve


– 2.1 GOPS/mW, 20 GOPS/mm2

25
References
Š ASIC design and verification
– D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, and R.W. Brodersen,
"Methods for True Energy‐Performance Optimization," IEEE J. Solid‐State
Circuits, vol. 39, no. 8, pp. 1282‐1293, Aug. 2004.
– D. Markovic, R.W. Brodersen, and B. Nikolic, "A 70GOPS 34mW Multi‐Carrier
MIMO Chip in 3.5mm2," in Proc. IEEE Int'l Symp. on VLSI Circuits (VLSI'06),
June 2006, pp. 196‐197.
– D. Markovic, B. Nikolic, R.W. Brodersen, “Power and Area Minimization for
Multidimensional Signal Processing,” IEEE J. Solid‐State Circuits, vol. 42, no.
4, pp. 922‐934, April 2007.
– D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, and R.W. Brodersen,
“ASIC Design and Verification in an FPGA Environment,” in Proc. IEEE Custom
Integrated Circuits Conf. (CICC’07), Sept. 2007, pp. 737‐740.
Š More publications available online
– www.ee.ucla.edu/~dejan

26
Acknowledgments
Š Funding support
– C2S2 Focus Center Research Program,
contract 2003‐CT‐888

Š Infrastructure support
– ST Microelectronics, Xilinx (hardware)
– Synplicity, Synopsys, Cadence (software)

27

You might also like