Professional Documents
Culture Documents
WITH
TORCH + AUTOGRAD
B AY A R E A D E E P L E A R N I N G S C H O O L
A L E X W I LT S C H K O
RESEARCH ENGINEER
TWITTER
A D VA N C E D T E C H N O L O G Y G R O U P
@ A W I LT S C H
#DLSCHOOL
B AY A R E A D E E P L E A R N I N G S C H O O L
M AT E R I A L D E V E L O P E D W I T H
S O U M I T H C H I N TA L A
HUGO LAROCHELLE
R YA N A D A M S
B AY A R E A D E E P L E A R N I N G S C H O O L
M AT E R I A L AVA I L A B L E AT
G I T H U B . C O M /A L E X B W/ B AYA R E A - D L - S U M M E R S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
An array programming library for Lua, looks a lot like NumPy and Matlab
B AY A R E A D E E P L E A R N I N G S C H O O L
W HAT I S
B AY A R E A D E E P L E A R N I N G S C H O O L
W HAT I S
B AY A R E A D E E P L E A R N I N G S C H O O L
W HAT I S
B AY A R E A D E E P L E A R N I N G S C H O O L
W HAT I S
150+ Tensor functions
Linear algebra
Convolutions
BLAS
Tensor manipulation
Fully documented:
https://github.com/torch/torch7/tree/
master/doc
B AY A R E A D E E P L E A R N I N G S C H O O L
W HAT I S
Inline help
Good docs online
http://torch.ch/docs/
B AY A R E A D E E P L E A R N I N G S C H O O L
W HAT I S
World of Warcraft
Adobe Lightroom
Redis
nginx
W HAT I S
B AY A R E A D E E P L E A R N I N G S C H O O L
TOR CH - A RITHMETIC
Lots of functions that can operate on tensors, all the basics, slicing, BLAS, LAPACK, cephes, rand
B AY A R E A D E E P L E A R N I N G S C H O O L
Boolean functions
B AY A R E A D E E P L E A R N I N G S C H O O L
Special functions
http://deepmind.github.io/torch-cephes/
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
COMMUNITY
B AY A R E A D E E P L E A R N I N G S C H O O L
COMMUNITY
Code for cutting edge models shows up for Torch very quickly
B AY A R E A D E E P L E A R N I N G S C H O O L
COMMUNITY
B AY A R E A D E E P L E A R N I N G S C H O O L
COMMUNITY
B AY A R E A D E E P L E A R N I N G S C H O O L
COMMUNITY
B AY A R E A D E E P L E A R N I N G S C H O O L
COMMUNITY
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
TensorFlow
Caffe
Industry:
Stability
Scale & speed
Data Integration
Relatively Fixed
Torch
Neon
Theano
Research:
Flexible
Fast Iteration
Debuggable
Relatively bare bone
B AY A R E A D E E P L E A R N I N G S C H O O L
CO R E P H I LOSO PHY
Interactive computing
Imperative programming
Minimal abstraction
No compilation time
Thinking linearly
Maximal Flexibility
T E NS OR S AND
STOR AGES
Row-major in memory
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
Row-major in memory
size: 4 x 6
stride: 6 x 1
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
1-indexed
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
Size: 6
Stride: 1
Offset:13
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
Underlying storage is shared
B AY A R E A D E E P L E A R N I N G S C H O O L
T E NS OR S AND
STOR AGES
require cutorch
torch.CudaTensor = torch.FloatTensor on
GPU
B AY A R E A D E E P L E A R N I N G S C H O O L
T R AI NI N G CYCL E
Moving parts
B AY A R E A D E E P L E A R N I N G S C H O O L
T R AI NI N G CYCL E
threads
nn
nn
optim
B AY A R E A D E E P L E A R N I N G S C H O O L
T H E N N PAC KAGE
threads
nn
nn
optim
B AY A R E A D E E P L E A R N I N G S C H O O L
T H E N N PAC KAGE
B AY A R E A D E E P L E A R N I N G S C H O O L
T H E N N PAC KAGE
Compose networks
like Lego blocks
B AY A R E A D E E P L E A R N I N G S C H O O L
T H E N N PAC KAGE
B AY A R E A D E E P L E A R N I N G S C H O O L
T H E N N PAC KAGE
CUDA Backend via the cunn package
B AY A R E A D E E P L E A R N I N G S C H O O L
TO R C H-AU TO GR AD BY
B AY A R E A D E E P L E A R N I N G S C H O O L
TH E OP T I M PACKAG E
B AY A R E A D E E P L E A R N I N G S C H O O L
TH E OP T I M PACKAG E
A purely functional view of the world
B AY A R E A D E E P L E A R N I N G S C H O O L
TH E OP T I M PACKAG E
Collecting the parameters of your neural net
Substitute each module weights and biases by one large tensor, making weights and biases
point to parts of this tensor
B AY A R E A D E E P L E A R N I N G S C H O O L
TOR CH AUTOGRAD
Industrial-strength, extremely flexible automatic dierentiation, for all your crazy ideas
B AY A R E A D E E P L E A R N I N G S C H O O L
Linear Algebra
Common
Subroutines
BLAS
LINPACK
LAPACK
Est: 1957
Est: 1979
(now on GitHub!)
Est: 1984
B AY A R E A D E E P L E A R N I N G S C H O O L
Left to right:
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
dbda = 0
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
dbda = 0
c=1
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
dbda = 0
c=1
c=1
dcda = 0
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
dbda = 0
c=1
c=1
dcda = 0
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
dbda = 0
c=1
c=1
dcda = 0
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
ddda = math.sin(b) = 0.909
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
dada = 1
b=2
b=2
dbda = 0
c=1
c=1
dcda = 0
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
ddda = math.sin(b) = 0.909
return 2.728
return 0.909
B AY A R E A D E E P L E A R N I N G S C H O O L
Right to left:
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
c=1
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
c=1
d = a * math.sin(b) = 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
b=2
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
b=2
c=1
c=1
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
b=2
c=1
c=1
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
return 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
b=2
c=1
c=1
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
return 2.728
dddd = 1
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
b=2
c=1
c=1
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
return 2.728
dddd = 1
ddda = dd * math.sin(b) = 0.909
B AY A R E A D E E P L E A R N I N G S C H O O L
a=3
a=3
b=2
b=2
c=1
c=1
d = a * math.sin(b) = 2.728
d = a * math.sin(b) = 2.728
return 2.728
dddd = 1
ddda = dd * math.sin(b) = 0.909
return 0.909, 2.728
B AY A R E A D E E P L E A R N I N G S C H O O L
A trainable
neural network
in torch-autograd
Any numeric
function can
go here
These two fn's
are split only
for clarity
This is the API >
This is a how
the parameters
are updated
B AY A R E A D E E P L E A R N I N G S C H O O L
A trainable
neural network
in torch-autograd
Any numeric
function can
go here
These two fn's
are split only
for clarity
This is the API >
This is a how
the parameters
are updated
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
WH AT'S
sq
ACTUALLY
sum
sq
HAPPENING?
Whole-Model
When it comes time to evaluate partialLayer-Based
derivatives, we just
sub derivatives from a table
sub
have to look up the partial
add
mult
sum
sq
Full Autodiff
add
mult
sub
add
mult
We can then calculate the derivative of the loss w.r.t. inputs via the chain
rule!
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Autograd gives you derivatives of numeric code, without a special mini-language
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Control flow, like if-statements, are handled seamlessly
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Scalars are good for demonstration, but autograd is most often used with tensor types
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Autograd shines if you have dynamic compute graphs
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Recursion is no problem.
Write numeric code as you ordinarily would, autograd handles the gradients
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Need new or tweaked partial derivatives? Not a problem.
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Need new or tweaked partial derivatives? Not a problem.
B AY A R E A D E E P L E A R N I N G S C H O O L
AUTOGRAD EXAMPLES
Need new or tweaked partial derivatives? Not a problem.
B AY A R E A D E E P L E A R N I N G S C H O O L
Whole-Model
sum
sum
sum
sq
sq
sq
sub
Layer-Based
add
mult
scikit-learn
sub
Full Autodiff
add
mult
Torch NN
Keras
Lasagne
sub
add
mult
Autograd
Theano
TensorFlow
B AY A R E A D E E P L E A R N I N G S C H O O L
Whole-Model
sum
sum
sum
sq
sq
sq
sub
Layer-Based
add
mult
scikit-learn
sub
Full Autodiff
add
mult
Torch NN
Keras
Lasagne
sub
add
mult
Autograd
Theano
TensorFlow
B AY A R E A D E E P L E A R N I N G S C H O O L
Whole-Model
sum
sum
sum
sq
sq
sq
sub
Layer-Based
add
mult
sub
Full Autodiff
add
mult
sub
add
mult
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
sum
sq
sub
add
target
mult
b
W
input
B AY A R E A D E E P L E A R N I N G S C H O O L
Explicit
Ahead-of-Time
Just-in-Time
NN
TensorFlow
Autograd
Caffe
Theano
Chainer
No compiler optimizations,
no dynamic graphs
No compiler optimizations
B AY A R E A D E E P L E A R N I N G S C H O O L
Static Dataflow
Hybrid
JIT Dataflow
NN
TensorFlow
Autograd
Sea of Nodes
(Click & Paleczny, 1995)
?
Caffe
Theano
Chainer
B AY A R E A D E E P L E A R N I N G S C H O O L
Whole-Model
sum
sum
sum
sq
sq
sq
sub
Layer-Based
add
mult
sub
Full Autodiff
add
mult
sub
add
mult
IMPACT AT TWITTER
Prototyping without fear
B AY A R E A D E E P L E A R N I N G S C H O O L
Checkpointing don't save all of the intermediate values. Recompute them when you
need them (memory savings, potentially speedup if compute is faster than load/store,
possibly good with pointwise functions like ReLU). MXNet I think is the first to implement
this generally for neural nets.
Source-to-source All neural net autodi packages are either AOT or JIT graph
construction, with operator overloading. The original autodi (in FORTRAN, in the 80s)
was source transformation. Considered gold-standard for performance in autodi field.
Challenge is control flow.
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
B AY A R E A D E E P L E A R N I N G S C H O O L
QUESTIONS?
Happy to help. Find me at:
@awiltsch
awiltschko@twitter.com
github.com/alexbw
B AY A R E A D E E P L E A R N I N G S C H O O L