Carbon Nanotube Computing: Subhasish Mitra Collaborator: H.-S. Philip Wong

Carbon Nanotube Computing
Subhasish Mitra
Collaborator: H.-S. Philip Wong
Department of EE & Department of CS

Stanford University
US National Academy of Sciences (2011)
Improve Computing Performance

System
integration
Device
performance
Option 1: Better Transistors

System
integration
Few experimental demos

Transistors system
Device
performance
Option 2: Design Tricks

System
integration
Multi-cores
Limited tricks
Complexity design bugs
Power
management
Device
performance
Improve Computing Performance

System
integration
Multi-cores
Target:
1,000X performance
New innovations required
Power
management
Device
performance
Solution: Nanosystems
Transform new nanotech
into new systems
New devices
enable new applications
New fabrication
New
Architectures
imperfections?
New sensors
large-scale fabrication?
variability?
Carbon Nanotube FET (CNFET)

CNT: d = 1.2nm
CNFET
Energy Delay Product

Gate
2 m
~ 10X benefit
Full-chip case studies
[ARM, IBM, IMEC, Stanford]
Sub-litho
8
Example: IBM POWER 7
Energy (fJ/transition)
Nodes: 22 14 ... 11 8 ... 5 (FinFET)

IBM Power 7 modeled
1.6
1.2
100 W/cm2
Nodes: 8 .. 5 (CNFET)
x1/3
0.8
x3
0.4
20 W/cm2
0
0
D. Frank and W. Haensch, IBM
Preferred
4
8
12
16
Performance (1015 transitions/sec.)
9
Many FAQs Answered

1. Where do benefits come from ?
2. CNFET contact resistance ?
3. Wires limit performance ? Why research FETs ?
4. Arent FETs good enough already ?
5. CNT variations ?
10
CNFET Inverter
INPUT
P+ Doped
N+ Doped
11
Big Promise, Major Obstacles

l
Process advances alone inadequate

Mis-positioned CNTs
Metallic CNTs
Imperfection-immune paradigm
[Zhang IEEE TCAD 12]
12
CNT Growth circa 2005

l
Highly mis-positioned
10 m
13
First Wafer-Scale Aligned CNT Growth

Quartz wafer
with catalyst
Aligned
CNT growth
Quartz wafer with CNTs
99.5% aligned CNTs
20m
Stanford Nanofabrication Facility
[Patil VLSI Tech. 08, IEEE TNANO 09]
14
Wafer-Scale CNT Transfer

High-temperature CNT growth
Low-temperature circuit fabrication
CNT transfer
900 C
Before transfer
Quartz
120 C
After transfer
SiO2/Si
CNTs
2 m
[Patil VLSI Tech. 08, IEEE TNANO 09]
2 m
15
Mis-Positioned CNT-Immune NAND

1. Grow CNTs
[Patil IEEE TCAD 09]
16

Vdd
1. Grow CNTs
2. Extended gate, contacts
A
Out
CRUCIAL
A
B
Gnd
17

Vdd
1. Grow CNTs
2. Extended gate, contacts
3. Etch gate & CNTs
4. Dope P & N regions
l
Arbitrary logic functions
A
Out
A
Etched
region
essential
Graph algorithms
Gnd
18
VLSI Metallic CNT Removal

l
Arbitrary technology nodes: 10nm & beyond
Universally effective
Relaxed node
m-CNTs Erased
Scaled circuits
Record selectivity
99.99% m-CNTs erased, 1% s-CNTs erased
[Shulaker IEDM 15]
19
Most Importantly
l
VLSI processing
No per-unit customization
VLSI design
Immune CNT library
20
First Sub-system: ISSCC Demo
[Shulaker ISSCC 13, IEEE JSSC 14] Collaborator: Prof. G. Gielen, KU Leuven
21
First Sub-system: ISSCC Demo

ISSCC Jack Raper Outstanding Technology Directions Paper
Sacha: CNT Controlled Hand-shaking Robot
Wafer with CNFET circuits
Robot
[Shulaker ISSCC 13, IEEE JSSC 14] Collaborator: Prof. G. Gielen, KU Leuven
22
CNT Computer
Prof. M.
Shulaker, MIT
[Shulaker Nature 13]
23
CNT Computer
l
Turing-complete processor: entirely CNFETs

Instruction Fetch
[Shulaker Nature 13]
Data Fetch
ALU
Write-back
24
System Demos
25
Reproducible Results
80 ALUs
200 D-Latches
~ 1,600 CNFETs
~ 1,800 CNFETs
Waveforms overlaid
26
High-Performance CNFETs
Doping
Current Drive
Dielectric
interactions
Contact Resistance
Scaling
27
High-Performance CNFETs
l
High-density CNTs
> 100 CNTs/m

Major challenge
New result
> 100 CNTs/m
Record ION density
ION (A/m)
Controlled variations
CNFET
Si FET
(Stanford lab) (foundries)
[Shulaker IEDM 14]
28
CNT Variations NOT Showstopper

Circuit energy penalty, circuit delay penalty: < 5%
Co-optimize Processing & Design
m-CNTs, removal
Special layouts
CNT spacing
Sizing
Optimized Guidelines
Energy
[Zhang IEDM 11, Hills IEEE TCAD 15]
Noise margin
Delay
29
10X EDP, BUT
How can we do better ?
30
Abundant-Data Applications
Huge memory wall: processors, accelerators
Energy Measurements
Genomics classification
Natural language processing
5%
18%
95%
82%
Compute
Intel performance counter monitors 2 CPUs, 8-cores/CPU + 128GB DRAM
Memory
31
Nano-Engineered
Computing Systems Technology
[Aly IEEE Computer 15]
32
N3XT Nanosystems
Computation immersed in memory
Increased functionality
Memory
Fine-grained,
ultra-dense 3D
Computing logic
Impossible with todays technologies

33
N3XT Computation Immersed in Memory

3D Resistive RAM
Massive storage
1D CNFET, 2D FET
Compute, RAM access
No TSV
thermal
STT MRAM
Quick access
Ultra-dense,
fine-grained
vias
1D CNFET, 2D FET
Compute, RAM access
thermal
Silicon
compatible
1D CNFET, 2D FET
Compute, Power, Clock
thermal
34
Many Nano-scale Innovations

Memory & logic devices
Embedded cooling
30 m thick
3D Resistive RAM (RRAM)

Phase change: hotspots suppressed
MoS2
<1 nm
2D FETs: large-area monolayer MoS2
Vertical metal nanowire arrays

35
3D Integration
l
Massive ILV density >> TSV density
TSV (chip stacking)
Through silicon via

(TSV)
Dense, e.g., monolithic
Nano-scale
inter-layer vias (ILVs)
36
Realizing Monolithic 3D
l
Low-temperature fabrication: < 400 C
37
First CNT Monolithic 3D ICs

Nano-scale vias,
no TSVs
Process temp.
< 250 oC
Inter-layer digital circuits
3-Layer integration
VOUT (V)
3
2
1
VIN (V)
0
0
[Wei IEDM 13, Shulaker VLSI Tech 14]
38
Device + Architecture Benefits

Naturally enabled
+
Emerging
logic
de
Top Electro
de
Metal Oxi
ode
r
t
c
e
l
B tm E
Emerging
memory
Monolithic 3D
integration
39
3D NanoSystem
Wafer-scale design + fabrication
[Unpublished]
40
Unique N3XT Technology

l
End-to-end
Isolated improvements inadequate
Existing efforts
Chip
stacking
N3XT
1D / 2D
FETs,
RRAM,
mRAM
New
3D
fabrication
Nanoscale
cooling
Memories
New
apps
Architecture
&
software
Yield,
reliability
Abundant
data
apps
41
Complement with Software Solutions
DSL
compiler
Co-optimized
s/w + h/w
DSL = Domain-Specific Language
Learning:
key
architectural
concept
CrossLayer
Resilience
Runtime
optimization
Yield,
reliability
42
N3XT Framework
l
Heterogeneous nanotechnologies
Architecture design space
Physical design
Integrated thermal analysis
Yield, reliability
43
Sweet Spot: Abundant-Data Apps.

IBM graph analytics
2D
Single-chip N3XT
64 GB off-chip DRAM
64 GB on-chip 3D RRAM
Simple
interface
DDR3
interface
STTRAM
cache
64 processor cores
SRAM cache
64 processor cores
44

IBM graph analytics
~1,000X benefits, software programmable

Benefits
80
60
851x
656x
400x
SSSP
Connected
Components
510x
700x
970x
40
20
0
PageRank
Energy
BFS
Logistic
Linear
Regression Regression
Execution Time
45

IBM graph analytics
851X benefits
100%
Energy: 37X
100%
80%
80%
60%
60%
40%
40%
20%
20%
2.7%
0%
2D
PageRank app.
N3XT
Exec. Time: 23X
4.3%
0%
2D
N3XT
46

IBM graph analytics
851X benefits
100%
Energy: 37X
100%
80%
3%
80%
60%
2%
60%
40%
1%
40%
20%
0%
20%
Exec. Time: 23X

5%
3%
0%
0%
0%
2D
Processor active
PageRank app.
2D
N3XT
Processor stall
N3XT
Memory access
47
More Opportunities
Specialization
Neuro-inspired
Technology innovations
48
Students, Sponsors & Collaborators
A. SIMD Challenge Loops

In this subsection, we describe the SIMD approach to
vectorizing five classes of loops, explaining the difficulties
SIMD compilers face using examples in the first two columns
of Figure 6. The examples in this figure are later revisited to
demonstrate the DySER compilers approach.
Reduction/Induction: Loops which have contiguous memory
access across iterations and lack control flow or loop dependencies are easily SIMD-vectorizable. Figure 6(a) shows an
example reduction loop with an induction variable use. The
SIMD compiler can vectorize the reduction variable c by
accumulating to multiple variables (scalar expansion), vectorizing the induction variable by hoisting initialization out of the
loop, and performing non vector-size divisible loop iterations
by executing a peeled loop (not shown in diagram).
Control Dependence: SIMD compilers typically vectorize
loops with control flow using if-conversion and masking.
Though vectorization is possible, the masking overhead can
be significant. One example, shown in Figure 6(b), is to apply
a masking technique where both sides of the branch are
executed, and the final result is merged using a mask created
by evaluating the predicate on the vector C. Note that four
extra instructions per loop are introduced for masking.
Strided Data Access: Strided data access can occur for a
variety of reasons, commonly for accessing arrays of structs.
Vectorizing compilers can sometimes eliminate the strided
access by transforming the data structure into a struct of arrays.
However, this transformation requires global information about
data structure usage, and is not always possible. Figure 6(c)
shows the transformations for a complex multiplication loop,
which cannot benefit from array-struct transformations. A
vectorized version, provided by Nuzman et al. [22], packs and
unpacks data explicitly with extra instructions on the critical
path of the computation.
N3XT
The Day
After Tomorrow
Au
Tomorrow
CGRAs
Fig. 1.
Conceptual Models of Vector SIMD and DySER
Commodity
B. DySERs Architecture andhardware
Execution Model
ann
Ad
Today
To address the challenges of SIMD compilation, we leverage the DySER architecture as our in-core accelerator. In this
Hardware
subsection we briefly describe DySER,
and further details are
in Govindaraju et al. [10], [9].
Architecture DySER is an array of configurable functional
units connected with a circuit switched network of simple
switches. A functional unit can be configured to receive
its inputs from any of its neighboring switches. When all
its inputs arrive, it performs the operation and delivers the
output to a neighboring switch. Switches can be configured
to route their inputs to any of their outputs, forming a circuit
switched network. With this configurable network of functional
units, a specialized hardware datapath can be created for a
sequence of computation. It supports pipelining and dataflow
execution with simple credit based flow control. The switches
in the edge of the array are connected to FIFOs, which are
exposed to the processor core as DySERs input/output ports.
DySER is tightly integrated with a general purpose processor
pipeline, and acts as a long latency functional unit that has
a direct datapath from the register file and from memory.
The processor can send/receive data or load/store data to/from
DySER directly through ISA extensions.
Execution Model Figure 2 shows DySERs execution model.
Before a program uses DySER, it configures DySER by pro-
Softwa
49
Conclusion
l
Nanosystems today
Game ON, to the
N3XT 1,000X
era
Compute + memory + sensing

Densely interwoven
50

Carbon Nanotube Computing: Subhasish Mitra Collaborator: H.-S. Philip Wong

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Carbon Nanotube Computing: Subhasish Mitra Collaborator: H.-S. Philip Wong

Uploaded by

Copyright:

Available Formats

Carbon Nanotube Computing

Department of EE & Department of CS

US National Academy of Sciences (2011)

Improve Computing Performance

Option 1: Better Transistors

Few experimental demos

Option 2: Design Tricks

Improve Computing Performance

Carbon Nanotube FET (CNFET)

Energy Delay Product

Example: IBM POWER 7

Nodes: 22 14 ... 11 8 ... 5 (FinFET)

Many FAQs Answered

Big Promise, Major Obstacles

Process advances alone inadequate

CNT Growth circa 2005

First Wafer-Scale Aligned CNT Growth

Quartz wafer with CNTs

99.5% aligned CNTs

Wafer-Scale CNT Transfer

Low-temperature circuit fabrication

Mis-Positioned CNT-Immune NAND

[Patil IEEE TCAD 09]

Mis-Positioned CNT-Immune NAND

Mis-Positioned CNT-Immune NAND

Arbitrary logic functions

VLSI Metallic CNT Removal

Arbitrary technology nodes: 10nm & beyond

First Sub-system: ISSCC Demo

First Sub-system: ISSCC Demo

[Shulaker Nature 13]

Turing-complete processor: entirely CNFETs

[Shulaker Nature 13]

> 100 CNTs/m

CNT Variations NOT Showstopper

10X EDP, BUT

How can we do better ?

Natural language processing

Intel performance counter monitors 2 CPUs, 8-cores/CPU + 128GB DRAM

[Aly IEEE Computer 15]

Impossible with todays technologies

N3XT Computation Immersed in Memory

Many Nano-scale Innovations

3D Resistive RAM (RRAM)

2D FETs: large-area monolayer MoS2

Vertical metal nanowire arrays

Massive ILV density >> TSV density

TSV (chip stacking)

Through silicon via

Dense, e.g., monolithic

Low-temperature fabrication: < 400 C

First CNT Monolithic 3D ICs

Inter-layer digital circuits

[Wei IEDM 13, Shulaker VLSI Tech 14]

Device + Architecture Benefits

Unique N3XT Technology

Complement with Software Solutions

DSL = Domain-Specific Language

Architecture design space

Integrated thermal analysis

Sweet Spot: Abundant-Data Apps.

Sweet Spot: Abundant-Data Apps.

~1,000X benefits, software programmable