You are on page 1of 50

Carbon Nanotube Computing

Subhasish Mitra
Collaborator: H.-S. Philip Wong

Department of EE & Department of CS


Stanford University

US National Academy of Sciences (2011)

Improve Computing Performance


System
integration

Device
performance

Option 1: Better Transistors


System
integration

Few experimental demos


Transistors system

Device
performance

Option 2: Design Tricks


System
integration
Multi-cores

Limited tricks
Complexity design bugs

Power
management

Device
performance

Improve Computing Performance


System
integration
Multi-cores

Target:
1,000X performance
New innovations required

Power
management

Device
performance

Solution: Nanosystems
Transform new nanotech
into new systems
New devices
enable new applications
New fabrication

New
Architectures

imperfections?
New sensors

large-scale fabrication?
variability?

Carbon Nanotube FET (CNFET)


CNT: d = 1.2nm
CNFET

Energy Delay Product


Gate
2 m

~ 10X benefit
Full-chip case studies
[ARM, IBM, IMEC, Stanford]

Sub-litho
8

Example: IBM POWER 7

Energy (fJ/transition)

Nodes: 22 14 ... 11 8 ... 5 (FinFET)


IBM Power 7 modeled

1.6
1.2

100 W/cm2
Nodes: 8 .. 5 (CNFET)
x1/3

0.8

x3
0.4
20 W/cm2

0
0
D. Frank and W. Haensch, IBM

Preferred

4
8
12
16
Performance (1015 transitions/sec.)
9

Many FAQs Answered


1. Where do benefits come from ?
2. CNFET contact resistance ?
3. Wires limit performance ? Why research FETs ?
4. Arent FETs good enough already ?
5. CNT variations ?

10

CNFET Inverter
INPUT

P+ Doped
N+ Doped

11

Big Promise, Major Obstacles


l

Process advances alone inadequate


Mis-positioned CNTs

Metallic CNTs

Imperfection-immune paradigm
[Zhang IEEE TCAD 12]

12

CNT Growth circa 2005


l

Highly mis-positioned

10 m

13

First Wafer-Scale Aligned CNT Growth


Quartz wafer
with catalyst
Aligned
CNT growth

Quartz wafer with CNTs

99.5% aligned CNTs

20m
Stanford Nanofabrication Facility
[Patil VLSI Tech. 08, IEEE TNANO 09]

14

Wafer-Scale CNT Transfer


High-temperature CNT growth

Low-temperature circuit fabrication

CNT transfer

900 C
Before transfer
Quartz

120 C
After transfer
SiO2/Si

CNTs
2 m
[Patil VLSI Tech. 08, IEEE TNANO 09]

2 m
15

Mis-Positioned CNT-Immune NAND


1. Grow CNTs

[Patil IEEE TCAD 09]

16

Mis-Positioned CNT-Immune NAND


Vdd

1. Grow CNTs
2. Extended gate, contacts

A
Out

CRUCIAL

A
B
Gnd
17

Mis-Positioned CNT-Immune NAND


Vdd

1. Grow CNTs
2. Extended gate, contacts
3. Etch gate & CNTs
4. Dope P & N regions
l

Arbitrary logic functions

A
Out
A

Etched
region
essential

Graph algorithms
Gnd
18

VLSI Metallic CNT Removal


l

Arbitrary technology nodes: 10nm & beyond

Universally effective

Relaxed node

m-CNTs Erased

Scaled circuits

Record selectivity
99.99% m-CNTs erased, 1% s-CNTs erased
[Shulaker IEDM 15]

19

Most Importantly
l

VLSI processing
No per-unit customization

VLSI design
Immune CNT library

20

First Sub-system: ISSCC Demo

[Shulaker ISSCC 13, IEEE JSSC 14] Collaborator: Prof. G. Gielen, KU Leuven

21

First Sub-system: ISSCC Demo


ISSCC Jack Raper Outstanding Technology Directions Paper
Sacha: CNT Controlled Hand-shaking Robot
Wafer with CNFET circuits
Robot

[Shulaker ISSCC 13, IEEE JSSC 14] Collaborator: Prof. G. Gielen, KU Leuven

22

CNT Computer
Prof. M.
Shulaker, MIT

[Shulaker Nature 13]

23

CNT Computer
l

Turing-complete processor: entirely CNFETs


Instruction Fetch

[Shulaker Nature 13]

Data Fetch

ALU

Write-back

24

System Demos

25

Reproducible Results
80 ALUs

200 D-Latches

~ 1,600 CNFETs

~ 1,800 CNFETs

Waveforms overlaid
26

High-Performance CNFETs
Doping

Current Drive
Dielectric
interactions

Contact Resistance

Scaling

27

High-Performance CNFETs
l

High-density CNTs

> 100 CNTs/m


Major challenge
New result
> 100 CNTs/m
Record ION density

ION (A/m)

Controlled variations
CNFET
Si FET
(Stanford lab) (foundries)
[Shulaker IEDM 14]

28

CNT Variations NOT Showstopper


Circuit energy penalty, circuit delay penalty: < 5%
Co-optimize Processing & Design
m-CNTs, removal

Special layouts

CNT spacing

Sizing

Optimized Guidelines

Energy
[Zhang IEDM 11, Hills IEEE TCAD 15]

Noise margin

Delay
29

10X EDP, BUT

How can we do better ?

30

Abundant-Data Applications
Huge memory wall: processors, accelerators
Energy Measurements
Genomics classification

Natural language processing

5%

18%

95%

82%
Compute

Intel performance counter monitors 2 CPUs, 8-cores/CPU + 128GB DRAM

Memory
31

Nano-Engineered
Computing Systems Technology

[Aly IEEE Computer 15]

32

N3XT Nanosystems
Computation immersed in memory
Increased functionality

Memory

Fine-grained,
ultra-dense 3D
Computing logic

Impossible with todays technologies


33

N3XT Computation Immersed in Memory


3D Resistive RAM
Massive storage

1D CNFET, 2D FET
Compute, RAM access

No TSV
thermal

STT MRAM
Quick access

Ultra-dense,
fine-grained
vias

1D CNFET, 2D FET
Compute, RAM access
thermal

Silicon
compatible

1D CNFET, 2D FET
Compute, Power, Clock
thermal

34

Many Nano-scale Innovations


Memory & logic devices

Embedded cooling

30 m thick

3D Resistive RAM (RRAM)


Phase change: hotspots suppressed

MoS2
<1 nm

2D FETs: large-area monolayer MoS2

Vertical metal nanowire arrays


35

3D Integration
l

Massive ILV density >> TSV density

TSV (chip stacking)

Through silicon via


(TSV)

Dense, e.g., monolithic

Nano-scale
inter-layer vias (ILVs)
36

Realizing Monolithic 3D
l

Low-temperature fabrication: < 400 C

37

First CNT Monolithic 3D ICs


Nano-scale vias,
no TSVs

Process temp.
< 250 oC

Inter-layer digital circuits

3-Layer integration

VOUT (V)

3
2
1

VIN (V)

0
0

[Wei IEDM 13, Shulaker VLSI Tech 14]

38

Device + Architecture Benefits


Naturally enabled

+
Emerging
logic

de
Top Electro
de
Metal Oxi
ode
r
t
c
e
l
B tm E

Emerging
memory

Monolithic 3D
integration

39

3D NanoSystem
Wafer-scale design + fabrication

[Unpublished]

40

Unique N3XT Technology


l

End-to-end
Isolated improvements inadequate
Existing efforts
Chip
stacking

N3XT
1D / 2D
FETs,
RRAM,
mRAM

New
3D
fabrication

Nanoscale
cooling

Memories

New
apps

Architecture
&
software

Yield,
reliability

Abundant
data
apps

41

Complement with Software Solutions

DSL
compiler

Co-optimized
s/w + h/w

DSL = Domain-Specific Language

Learning:
key
architectural
concept

CrossLayer
Resilience

Runtime
optimization

Yield,
reliability

42

N3XT Framework
l

Heterogeneous nanotechnologies

Architecture design space

Physical design

Integrated thermal analysis

Yield, reliability

43

Sweet Spot: Abundant-Data Apps.


IBM graph analytics

2D

Single-chip N3XT

64 GB off-chip DRAM

64 GB on-chip 3D RRAM
Simple
interface

DDR3
interface

STTRAM
cache

64 processor cores
SRAM cache

64 processor cores
44

Sweet Spot: Abundant-Data Apps.


IBM graph analytics

~1,000X benefits, software programmable


Benefits

80
60

851x

656x

400x

SSSP

Connected
Components

510x

700x

970x

40
20
0

PageRank

Energy

BFS

Logistic
Linear
Regression Regression

Execution Time
45

Sweet Spot: Abundant-Data Apps.


IBM graph analytics

851X benefits
100%

Energy: 37X

100%

80%

80%

60%

60%

40%

40%

20%

20%

2.7%

0%

2D
PageRank app.

N3XT

Exec. Time: 23X

4.3%

0%

2D

N3XT
46

Sweet Spot: Abundant-Data Apps.


IBM graph analytics

851X benefits
100%

Energy: 37X

100%

80%

3%

80%

60%

2%

60%

40%

1%

40%

20%

0%

20%

Exec. Time: 23X


5%
3%
0%

0%

0%

2D
Processor active
PageRank app.

2D

N3XT
Processor stall

N3XT
Memory access
47

More Opportunities
Specialization
Neuro-inspired
Technology innovations

48

Students, Sponsors & Collaborators

A. SIMD Challenge Loops


In this subsection, we describe the SIMD approach to
vectorizing five classes of loops, explaining the difficulties
SIMD compilers face using examples in the first two columns
of Figure 6. The examples in this figure are later revisited to
demonstrate the DySER compilers approach.
Reduction/Induction: Loops which have contiguous memory
access across iterations and lack control flow or loop dependencies are easily SIMD-vectorizable. Figure 6(a) shows an
example reduction loop with an induction variable use. The
SIMD compiler can vectorize the reduction variable c by
accumulating to multiple variables (scalar expansion), vectorizing the induction variable by hoisting initialization out of the
loop, and performing non vector-size divisible loop iterations
by executing a peeled loop (not shown in diagram).
Control Dependence: SIMD compilers typically vectorize
loops with control flow using if-conversion and masking.
Though vectorization is possible, the masking overhead can
be significant. One example, shown in Figure 6(b), is to apply
a masking technique where both sides of the branch are
executed, and the final result is merged using a mask created
by evaluating the predicate on the vector C. Note that four
extra instructions per loop are introduced for masking.
Strided Data Access: Strided data access can occur for a
variety of reasons, commonly for accessing arrays of structs.
Vectorizing compilers can sometimes eliminate the strided
access by transforming the data structure into a struct of arrays.
However, this transformation requires global information about
data structure usage, and is not always possible. Figure 6(c)
shows the transformations for a complex multiplication loop,
which cannot benefit from array-struct transformations. A
vectorized version, provided by Nuzman et al. [22], packs and
unpacks data explicitly with extra instructions on the critical
path of the computation.

N3XT

The Day
After Tomorrow

Au

Tomorrow

CGRAs
Fig. 1.

Conceptual Models of Vector SIMD and DySER

Commodity
B. DySERs Architecture andhardware
Execution Model

ann

Ad

Today

To address the challenges of SIMD compilation, we leverage the DySER architecture as our in-core accelerator. In this
Hardware
subsection we briefly describe DySER,
and further details are
in Govindaraju et al. [10], [9].
Architecture DySER is an array of configurable functional
units connected with a circuit switched network of simple
switches. A functional unit can be configured to receive
its inputs from any of its neighboring switches. When all
its inputs arrive, it performs the operation and delivers the
output to a neighboring switch. Switches can be configured
to route their inputs to any of their outputs, forming a circuit
switched network. With this configurable network of functional
units, a specialized hardware datapath can be created for a
sequence of computation. It supports pipelining and dataflow
execution with simple credit based flow control. The switches
in the edge of the array are connected to FIFOs, which are
exposed to the processor core as DySERs input/output ports.
DySER is tightly integrated with a general purpose processor
pipeline, and acts as a long latency functional unit that has
a direct datapath from the register file and from memory.
The processor can send/receive data or load/store data to/from
DySER directly through ISA extensions.
Execution Model Figure 2 shows DySERs execution model.
Before a program uses DySER, it configures DySER by pro-

Softwa

49

Conclusion
l

Nanosystems today

Game ON, to the

N3XT 1,000X

era

Compute + memory + sensing


Densely interwoven
50

You might also like