Professional Documents
Culture Documents
Lecture 14a
Learning layers of features by stacking RBMs
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
h2
W2
h1
copy binary state for each v
h1
Train this
RBM first
W1
Compose the
two RBM
models to
make a single
DBN model
h2
W2
h1
W1
v
Its not a Boltzmann machine!
h3
W3
h2
W2
h1
W1
data
p (v ) = p ( h ) p (v | h )
h
2000 units
10 labels
500 units
500 units
28 x 28
pixel
image
2000 units
500 units
500 units
28 x 28
pixel
image
1.6%
1.5%
1.4%
1.25%
1.15%1.0%
0.49%
Unsupervised layer-by-layer
pre-training followed by backprop: 0.39% (record at the time)
http://www.bbc.co.uk/news/technology-20266427
Before fine-tuning
After fine-tuning
AISTATS2009
Effect of Depth
without pre-training
w/o pre-training
with pre-training
stuff
high
bandwidth
image
label
image
low
bandwidth
label
E ( v,h) =
i vis
parabolic
containment
function
(vi bi ) 2
2 i2
bi
b h
hid
vi
vi
i, j
h j wij
energy-gradient
produced by the total
input to a visible unit
Gaussian-Binary RBMs
Lots of people have failed to get these to
work properly. Its extremely hard to learn
tight variances for the visible units.
It took a long time for us to figure out
why it is so hard to learn the visible
variances.
When sigma is small, we need many
more hidden units than visible units.
This allows small weights to produce
big top-down effects.
j
wij
i wij
i
When sigma is much less
than 1, the bottom-up effects
are too big and the top-down
effects are too small.
Fast approximations
n=
y = (x + 0.5 n)
n=1
R(a x) = a R(x)
but
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
etc.
WT
h2
W
v2
WT
h1
W
v1
WT
h0
W
v0
etc.
WT
h2
W
v2
WT
h1
W
v1
+
i
WT
h0
W
v0
etc.
The learning rule for a sigmoid belief net is:
WT
wij s j (si pi )
v2
WT
v1
s
s
j i
si2
W
WT
s1j ) +
s 2j
W
h1
s 0j (si0 si1 ) +
si1 (s 0j
h2
WT
s1j
si1
W
WT
W
h0
WT
s 0j
W
v0
si0
etc.
h2
W
v2
WT
h0
W
v0
WT
h1
W
v1
WT
h0
W
v0
etc.
Then freeze the first layer of weights in both
directions and learn the remaining weights
(still tied together).
This is equivalent to learning another
RBM, using the aggregated posterior
distribution of h0 as the data.
WT
h2
W
v2
WT
h1
W
v1
v1
WT
h0
h0
T
W frozen
W frozen
v0
s 0j (si0 si1 ) +
si1 (s 0j
etc.
WT
h2
W
v2
WT
h1
s1j )
s 0j si0
si1s1j
W
v1
WT
s 0j
h0
W
v0
si0