Professional Documents
Culture Documents
Subhasish Mitra
Collaborator: H.-S. Philip Wong
Device
performance
Device
performance
Limited tricks
Complexity design bugs
Power
management
Device
performance
Target:
1,000X performance
New innovations required
Power
management
Device
performance
Solution: Nanosystems
Transform new nanotech
into new systems
New devices
enable new applications
New fabrication
New
Architectures
imperfections?
New sensors
large-scale fabrication?
variability?
~ 10X benefit
Full-chip case studies
[ARM, IBM, IMEC, Stanford]
Sub-litho
8
Energy (fJ/transition)
1.6
1.2
100 W/cm2
Nodes: 8 .. 5 (CNFET)
x1/3
0.8
x3
0.4
20 W/cm2
0
0
D. Frank and W. Haensch, IBM
Preferred
4
8
12
16
Performance (1015 transitions/sec.)
9
10
CNFET Inverter
INPUT
P+ Doped
N+ Doped
11
Metallic CNTs
Imperfection-immune paradigm
[Zhang IEEE TCAD 12]
12
Highly mis-positioned
10 m
13
20m
Stanford Nanofabrication Facility
[Patil VLSI Tech. 08, IEEE TNANO 09]
14
CNT transfer
900 C
Before transfer
Quartz
120 C
After transfer
SiO2/Si
CNTs
2 m
[Patil VLSI Tech. 08, IEEE TNANO 09]
2 m
15
16
1. Grow CNTs
2. Extended gate, contacts
A
Out
CRUCIAL
A
B
Gnd
17
1. Grow CNTs
2. Extended gate, contacts
3. Etch gate & CNTs
4. Dope P & N regions
l
A
Out
A
Etched
region
essential
Graph algorithms
Gnd
18
Universally effective
Relaxed node
m-CNTs Erased
Scaled circuits
Record selectivity
99.99% m-CNTs erased, 1% s-CNTs erased
[Shulaker IEDM 15]
19
Most Importantly
l
VLSI processing
No per-unit customization
VLSI design
Immune CNT library
20
[Shulaker ISSCC 13, IEEE JSSC 14] Collaborator: Prof. G. Gielen, KU Leuven
21
[Shulaker ISSCC 13, IEEE JSSC 14] Collaborator: Prof. G. Gielen, KU Leuven
22
CNT Computer
Prof. M.
Shulaker, MIT
23
CNT Computer
l
Data Fetch
ALU
Write-back
24
System Demos
25
Reproducible Results
80 ALUs
200 D-Latches
~ 1,600 CNFETs
~ 1,800 CNFETs
Waveforms overlaid
26
High-Performance CNFETs
Doping
Current Drive
Dielectric
interactions
Contact Resistance
Scaling
27
High-Performance CNFETs
l
High-density CNTs
ION (A/m)
Controlled variations
CNFET
Si FET
(Stanford lab) (foundries)
[Shulaker IEDM 14]
28
Special layouts
CNT spacing
Sizing
Optimized Guidelines
Energy
[Zhang IEDM 11, Hills IEEE TCAD 15]
Noise margin
Delay
29
30
Abundant-Data Applications
Huge memory wall: processors, accelerators
Energy Measurements
Genomics classification
5%
18%
95%
82%
Compute
Memory
31
Nano-Engineered
Computing Systems Technology
32
N3XT Nanosystems
Computation immersed in memory
Increased functionality
Memory
Fine-grained,
ultra-dense 3D
Computing logic
1D CNFET, 2D FET
Compute, RAM access
No TSV
thermal
STT MRAM
Quick access
Ultra-dense,
fine-grained
vias
1D CNFET, 2D FET
Compute, RAM access
thermal
Silicon
compatible
1D CNFET, 2D FET
Compute, Power, Clock
thermal
34
Embedded cooling
30 m thick
MoS2
<1 nm
3D Integration
l
Nano-scale
inter-layer vias (ILVs)
36
Realizing Monolithic 3D
l
37
Process temp.
< 250 oC
3-Layer integration
VOUT (V)
3
2
1
VIN (V)
0
0
38
+
Emerging
logic
de
Top Electro
de
Metal Oxi
ode
r
t
c
e
l
B tm E
Emerging
memory
Monolithic 3D
integration
39
3D NanoSystem
Wafer-scale design + fabrication
[Unpublished]
40
End-to-end
Isolated improvements inadequate
Existing efforts
Chip
stacking
N3XT
1D / 2D
FETs,
RRAM,
mRAM
New
3D
fabrication
Nanoscale
cooling
Memories
New
apps
Architecture
&
software
Yield,
reliability
Abundant
data
apps
41
DSL
compiler
Co-optimized
s/w + h/w
Learning:
key
architectural
concept
CrossLayer
Resilience
Runtime
optimization
Yield,
reliability
42
N3XT Framework
l
Heterogeneous nanotechnologies
Physical design
Yield, reliability
43
2D
Single-chip N3XT
64 GB off-chip DRAM
64 GB on-chip 3D RRAM
Simple
interface
DDR3
interface
STTRAM
cache
64 processor cores
SRAM cache
64 processor cores
44
80
60
851x
656x
400x
SSSP
Connected
Components
510x
700x
970x
40
20
0
PageRank
Energy
BFS
Logistic
Linear
Regression Regression
Execution Time
45
851X benefits
100%
Energy: 37X
100%
80%
80%
60%
60%
40%
40%
20%
20%
2.7%
0%
2D
PageRank app.
N3XT
4.3%
0%
2D
N3XT
46
851X benefits
100%
Energy: 37X
100%
80%
3%
80%
60%
2%
60%
40%
1%
40%
20%
0%
20%
0%
0%
2D
Processor active
PageRank app.
2D
N3XT
Processor stall
N3XT
Memory access
47
More Opportunities
Specialization
Neuro-inspired
Technology innovations
48
N3XT
The Day
After Tomorrow
Au
Tomorrow
CGRAs
Fig. 1.
Commodity
B. DySERs Architecture andhardware
Execution Model
ann
Ad
Today
To address the challenges of SIMD compilation, we leverage the DySER architecture as our in-core accelerator. In this
Hardware
subsection we briefly describe DySER,
and further details are
in Govindaraju et al. [10], [9].
Architecture DySER is an array of configurable functional
units connected with a circuit switched network of simple
switches. A functional unit can be configured to receive
its inputs from any of its neighboring switches. When all
its inputs arrive, it performs the operation and delivers the
output to a neighboring switch. Switches can be configured
to route their inputs to any of their outputs, forming a circuit
switched network. With this configurable network of functional
units, a specialized hardware datapath can be created for a
sequence of computation. It supports pipelining and dataflow
execution with simple credit based flow control. The switches
in the edge of the array are connected to FIFOs, which are
exposed to the processor core as DySERs input/output ports.
DySER is tightly integrated with a general purpose processor
pipeline, and acts as a long latency functional unit that has
a direct datapath from the register file and from memory.
The processor can send/receive data or load/store data to/from
DySER directly through ISA extensions.
Execution Model Figure 2 shows DySERs execution model.
Before a program uses DySER, it configures DySER by pro-
Softwa
49
Conclusion
l
Nanosystems today
N3XT 1,000X
era