You are on page 1of 28

Chapter 15

Chapter F
15
Linear
actor Mo
Models
dels and
Auto-Enco
Auto-Encoders
ders
Linear Factor Mo dels and
Auto-Enco
ders
Linear
factor mo
models
dels are generativ
generative
e unsup
unsupervised
ervised learning models in whic
which
h we
imagine that some unobserved factors h explain the observed v
variables
ariables x through
mo dels areAuto-enco
generativeders
unsup
learning
modelsmethods
in whichthat
we
aLinear
linearfactor
transformation.
Auto-encoders
areervised
un
unsup
sup
supervised
ervised
learning
imagine
that some unobserved
factors
h explain
the observed
variables xparametthrough
learn
a representation
of the data,
typically
obtained
by a non-linear
a linear
transformation.
unsup
ervised learning
methods
that
ric
transformation
of th
thee Auto-enco
data, i.e., ders
fromare
x to
h, typically
a feedforward
neural
learn
a representation
of the. data,
by a non-linear
parametnet
network,
work,
but not necessarily
necessarily.
They typically
also learnobtained
a transformation
going backw
backwards
ards
ric transformation
of thetodata,
i.e., from
feedforward
neural
from
the representation
the data,
fromxhtotohx, ,typically
lik
likee the alinear
factor models.
network,
but mo
notdels
necessarily
. They
learn
a transformation
going
backwautoards
Linear
factor
models
therefore
only also
specify
a parametric
deco
decoder,
der,
whereas
from
thealso
representation
to the data,
from
to xlinear
, like the
linear
factor
models.
enco
specify a parametric
enco
factor
mo
like
PCA,
encoder
der
encoder.
der. hSome
models,
dels,
Linear
moond
delstotherefore
only der
specify
a parametric
whereas
autoactuallyfactor
corresp
an auto-enco
(a linear
one), butdeco
for der,
others
the enco
correspond
auto-encoder
encoder
der
enco
der
also
specify
a
parametric
enco
der.
Some
linear
factor
mo
dels,
like
PCA,
is implicitly dened via an inference mechanism that searches for an h that could
actually
correspthe
ond observ
to an ed
auto-enco
der (a linear one), but for others the enco der
ha
x.
have
ve generated
observed
is implicitly
via anders
inference
mechanism
thathistorical
searches landscap
for an h ethat
could
The idea dened
of auto-enco
has been
part of the
of neural
auto-encoders
landscape
ha
ve
generated
the observ
ed x1987;
.
net
for decades
(LeCun,
Bourlard and Kamp, 1988; Hin
networks
works
Hinton
ton and Zemel,
The
idea
auto-enco
ders
beeninpart
oft the
historical
landscapesomewhat
of neural
1994)
but
hasofreally
pic
picked
ked
uphas
speed
recen
recent
years.
They remained
networks for
for decades
(LeCun,
1987;
1988; Hin
ton and Zemel,
marginal
man
many
y years,
in part
dueBourlard
to what and
w
was
as Kamp,
an incomplete
understanding
of
1994)
but has really
picked up speed
in recent years.
They remained
somewhat
the
mathematical
interpretation
and geometrical
underpinnings
of auto-encoders,
marginal
man
yed
years,
in part
due to what
was20.12.
an incomplete understanding of
whic
which
h ar
areefor
dev
develop
elop
eloped
further
in Chapters
17 and
the An
mathematical
interpretation
geometrical
auto-encoder
is simplyand
a neural
net
netw
wunderpinnings
ork that triesoftoauto-encoders,
cop
copy
y its inwhichtoarits
e dev
eloped The
further
in Chapters
17 auto-enco
and 20.12.
put
output.
arc
architecture
hitecture
of an
auto-encoder
der is typically decomp
decomposed
osed
simply a in
neural
work that tries to copy its inin
into
toAn
theauto-encoder
following parts,isillustrated
Figurenet
15.1:
put to its output. The architecture of an auto-encoder is typically decomp osed
an following
input, x parts, illustrated in Figure 15.1:
intothe
an enco
encoder
derx function f
input,

an encoder function f
a code
de
or in
internal
ternal representation
466h = f (x)

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

reconstruc,on
reconstruc,on!r!
!r!
reconstruc,on!r!
Decoder
Decoder..g!
Decoder.g!

code
code!h!
!h!
code!h!

Encoder.
Encoder.f!
f!
Encoder.f!
input
input!x!
!x!
input!x!
Figure 15.1: General sc
schema
hema of an auto-encoder, mapping an input x to an output (called
reconstruction) r through an internal represen
representation
tation or co
code
de h. The auto-enco
auto-encoder
der has
Figure
15.1:
General
sc
hema
of
an
auto-encoder,
mapping
an
input
x
to
an
output
two comp
components:
onents: the enco
encoder
der f (mapping x to h) and the deco
decoder
der g (mapping h (called
to r).
reconstruction) r through an internal representation or code h. The auto-encoder has
two components: the encoder f (mapping x to h) and the decoder g (mapping h to r).

a deco
decoder
der function g
a deco
der function
g reconstruction
an
output,
also called
onstruction
r = g(h) = g(f (x))

output,
also L
called
reconstruction
) = g(f (xho
)) w go
aanloss
function
computing
a scalar L(rr=
, xg)(h
measuring
how
goo
od of a regiven
en input x. The ob
objective
jective is to minimize the
construction r is of the giv
a loss
function
L computing
a scalar set
L(rof
, xexamples
) measuring
exp
expected
ected
value of
L over the training
{x}ho
. w good of a re construction r is of the given input x. The objective is to minimize the
expected value of L over the training set of examples x .
15.1 Regularized Auto-Enco
Auto-Encoders
ders
{ }
Predicting
the input may sound
useless: what
preventt the auto-enco
auto-encoder
der
15.1 Regularized
Auto-Enco
derscould preven
from simply cop
copying
ying its input in
into
to its output? In the 20th cen
century
tury
tury,, this was
Predicting
input maythe
sound
useless: what
preven
auto-enco
ac
achieved
hieved bythe
constraining
arc
architecture
hitecture
of thecould
auto-enco
auto-encoder
dert the
to av
avoid
oid this, der
by
from
copying of
itsthe
input
totoitsb eoutput?
In the
century
thisinput
was
forcingsimply
the dimension
co
smaller than
the20th
dimension
of, the
code
deinh
ac
x.hieved by constraining the architecture of the auto-encoder to avoid this, by
forcing
the 15.2
dimension
of the
to b e smaller
the dimension
of the input
Figure
illustrates
thecotde
wohtypical
cases ofthan
auto-encoders:
undercomplete
x
.
(with
the dimension of the representation h smaller than the dimension of the in15.2
illustrates(with
the tthe
wo dimension
typical cases
oflarger
auto-encoders:
undercomplete
put Figure
x), ando
of h
than that of
x). Whereas
andovercomplete
vercomplete
(with work
the dimension
of the representation
h smaller
than
the dimensionbottlenec
of the inearly
with auto-encoders,
just lik
likee PCA,
uses an
undercomplete
bottleneck
k
put
x
),
ando
vercomplete
(with
the
dimension
of
h
larger
than
that
of
x
).
Whereas
in the sequence of lay
layers
ers to av
avoid
oid learning the identit
identity
y function, more recent w
work
ork
early work with auto-encoders, just like PCA, uses an undercomplete bottleneck
in the sequence of layers to avoid learning the identity function, more recent work

467
467
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

reconstruc4on
reconstruc4on!r!
!r!
reconstruc4on!r!
Decoder
Decoder**

Decoder
Decoder**

Decoder*

Decoder*

Code*bo,leneck
Code*bo,leneck!h:!
!h:!
undercomplete* !h:!
Code*bo,leneck
representa4on*
undercomplete*
representa4on*

Code
Code!h:!
!h:!
overcomplete*
Code!h:!
representa4on*
overcomplete*
representa4on*

Encoder*

Encoder*

Encoder*

Encoder*
input
input!x!
!x!
input!x!

Figure 15.2: Left: undercomplete represen


representation
tation (dimension of code h is less than dimension of input x). Right: overcomplete represen
representation.
tation. Ov
Overcomplete
ercomplete auto-encoders
Figure 15.2:
Left: form
undercomplete
represen
tation (dimension
of code
is dimension
less than direquire
some other
of regularization
(instead
of the constraint
onhthe
of
mension
of input
x). Right:
Overcomplete auto-encoders
h) to av
the trivial
solutionovercomplete
where r = xrepresen
for all xtation.
.
avoid
oid
require some other form of regularization (instead of the constraint on the dimension of
h) to avoid the trivial solution where r = x for all x.

allo
allows
ws overcomplete representations. What we ha
have
ve learned in recen
recentt years is
that it is p ossible to mak
makee the auto-encoder meaningfully capture the structure
allothe
ws input
overcomplete
representations.
What we
haveis learned
in recen
t years
is
of
distribution
even if the represen
representation
tation
ov
overcomplete,
ercomplete,
with
other
that itofisconstrain
p ossibletto
e the auto-encoder
capture
the structure
forms
constraint
ormak
regularization.
In fact, meaningfully
once you realize
that auto-enco
auto-encoders
ders
of the
input the
distribution
even if the(indirectly
representation
is ov
with other
can
capture
input distribution
(indirectly,
, not as
a ercomplete,
an explicit probabilit
probability
y
forms of constrain
or regularization.
In fact,
realize
function),
you alsot realize
that it should
needonce
moreyou
capacit
capacity
y asthat
oneauto-enco
increasesders
the
can
capture
distribution
(indirectly
, notthe
asamount
a an explicit
y
complexit
complexity
y ofthe
theinput
distribution
to b
bee captured
(and
of dataprobabilit
available):
function),
yoube
also
realize
should
need more
capacit
as one increases
the
it
should not
limited
bythat
the it
input
di
dimension.
mension.
This
is a yproblem
in particular
complexit
of the
distribution to which
be captured
thehidden
amount
of data
vailable):
with
the syhallo
hallow
w auto-encoders,
ha
have
ve a(and
single
la
layer
yer
(for athe
code).
it shouldthat
not hidden
be limited
by size
the input
dimension.
is a problem
in particular
Indeed,
la
lay
yer
con
controls
trols
both the This
dimensionalit
dimensionality
y reduction
conwith
the
shallo
w auto-encoders,
which haand
ve athe
single
hidden
layer
(forws
thetocode).
strain
straint
t (the
code
size at the bottleneck)
capacit
capacity
y (whic
(which
h allo
allows
learn
that hidden
layer size controls both the dimensionality reduction conaIndeed,
more complex
distribution).
strain
t (the the
codebottlenec
size at the
and the capacit
y (whic
h allo
ws to learn
Besides
ottleneck
k bottleneck)
constrain
constraint,
t, alternativ
alternative
e constrai
constraints
nts or
regularization
amethods
moredscomplex
distribution).
metho
hav
havee b
been
een
explored and can guarantee that the auto-enco
auto-encoder
der do
does
es someBesides
ottlenec
k constrain
t, alternativ
ey-lik
constrai
nts or regularization
thing
useful the
andbnot
just learn
some trivial
identit
identity-lik
y-like
e function:
methods have been explored and can guarantee that the auto-encoder do es something
useful and
notthe
justrepresen
learn some
trivialoridentit
y-lik
e function:
Sparsit
Sparsity
y of
representation
tation
of its
deriv
derivativ
ativ
ative
e: ev
even
en if the intermediate representation has a very high dimensionality
dimensionality,, the eective lo
local
cal
Sparsity of ythe
represen
tationofor
of itsthat
deriv
ative:a coordinate
even if thesysindimensionalit
dimensionality
(n
(number
umber
of degrees
freedom
capture
termediate representation has a very high dimensionality, the eective local

468
468
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

tem among the probable xs) could be much smaller if most of the elements
of h are zero (or an
any
y other constan
constant,
t, such that || || is close to zero). When
tem among the probable xs) could be much smaller if most of the elements
|| || is close to zero, h do
does
es not participate in encoding lo
local
cal changes in
of h are zero (or any other constant, such that
is close to zero). When
x. There is a geometrical in
interpretation
terpretation of this situation in terms of maniis closethat
to zero,
h does in
not
participate
encoding
cal cdiscussion
hanges in
||
fold le
learning
arning
is discussed
more
depth in||inChapter
17.loThe
x
There
is 16
a geometrical
interpretation
of this situation
in terms
mani|| . Chapter
||
in
also explains
ho
how
w an auto-encoder
naturally
tends of
to
tow
wards
fold learning
that is discussed
in more
depthfactors
in Chapter
17. Theindiscussion
learning
a co
coordinate
ordinate
system for
the actual
of variation
the data.
int Chapter
also
explains
howders
an auto-encoder
tendsoftosparse
wards
A
least four16typ
ypes
es of
auto-enco
auto-encoders
clearly fall innaturally
this category
learning
a coordinate system for the actual factors of variation in the data.
represen
representation:
tation:
At least four types of auto-enco ders clearly fall in this category of sparse
represen
tation:
Sparse
coding (Olshausen and Field, 1996) has been heavily studied
as an unsup
unsupervised
ervised feature learning and feature inference mechanism.
It
Sparse
coding
(Olshausen
and Field,
1996)
has been
is a linear
factor
model rather
than an
auto-enco
auto-encoder,
der,heavily
becausestudied
it has
as an
unsupparametric
ervised feature
learning
and feature
inference
mechanism.
no
explicit
enco
encoder,
der,
and instead
uses an
iterativ
iterative
e optimizaIt is apro
linear
factor
model rather
than an auto-enco
der, because
it has
tion
procedure
cedure
to compute
the maximally
likely code.
Sparse coding
no
explicit
parametric
encothat
der,are
andbinstead
usesand
an iterativ
optimizalo
looks
oks
for represen
representations
tations
oth sparse
explaine the
input
tion procedure
to compute
theofmaximally
likelya code.
Sparsefunction
coding
through
the decoder.
Instead
the code being
parametric
looks
represen
tations that
both
sparse
andis explain
input
of
the for
input,
it is considered
lik
likeare
e free
v
variable
ariable
that
obtainedthe
through
through
the decoder.
of the
code
being a parametric function
an
optimization,
i.e., aInstead
particular
form
of inference:
of the input, it is considered like free variable that is obtained through
an optimization,
form
h i.e.,
= f (axparticular
) = arg min
minL
L(g(of
h),inference:
x)) + (
(h
h)
(15.1)

h = f (x) = arg min L(g(h), x)) + (h)


(15.1)
where L is the reconstruction loss, f the (non-parametric) encoder, g
the (parametric) deco
decoder,
der, (
(h
h) is a sparsity regularizer, and in practice
where
L is the reconstruction
loss, f theSparse
(non-parametric)
g
the
minimization
can be approximate.
co
coding
ding has aencoder,
manifold
thegeometric
(parametric)
decoder, (that
h) is ais sparsity
regularizer,
in practice
or
interpretation
discussed
in Sectionand
15.8.
It also
the an
minimization
canas
beaapproximate.
Sparse
co
ding
has aedmanifold
has
interpretation
directed graphical
mo
model,
del,
describ
described
in more
or
geometric
interpretation
that is
discussed
in ob
Section
It also
details
in Section
19.3. To achiev
function
to
achieve
e sparsit
sparsity
y, the
objectiv
jectiv
jectivee15.8.
has an interpretation
as athat
directed
graphical
model,
described
in more
optimize
includes a term
is minimized
when
the represen
representation
tation
has
details
in
Section
19.3.
T
o
achiev
e
sparsit
y
,
the
ob
jectiv
e
function
man
many
y zero or near-zero values, such as the L1 penalt
penalty
y |h| =
|h |to
.
optimize includes a term that is minimized when the representation has
An
variationvalues,
of sparse
coding
combines
maninteresting
y zero or near-zero
such co
asding
the L1
penalty the
h freedom
=
h to
.
cho
hoose
ose the representation through optimization and a parametric en| | freedom
| |to
Ander.
interesting
ofe sparse
ding combines (PSD)
the
co
coder.
It is calledvariation
predictiv
predictive
sparseco
decomposition
(Ka
(Kavuk
vuk
vukcuoglu
cuoglu
choal.
ose
the representation
optimization
and a parametric enet
al.,
, 2008a)
and is brieythrough
described
in Section 15.8.2.
coder. It is called predictive sparse decomposition (PSD) (Kavukcuoglu
At al.
the
other end
spectrum
are in
simply
sparse
auto-enco
auto-encoders
ders
ders,,
et
, 2008a)
andofis the
briey
described
Section
15.8.2.
whic
which
h com
combine
bine with the standard auto-enco
auto-encoder
der sc
schema
hema a sparsit
sparsity
y penalty
At the
other end ofthe
theoutput
spectrum
areencoder
simply sparse
auto-enco
ders
whic
of the
to be sparse.
These
are,
which
h encourages
whic
h com
with the
standard
auto-enco
hema ,a sparsit
y penalty
describ
in Section
15.8.1.
Besides
the der
L1 sc
penalty
other sparsit
described
ed bine
penalty,
sparsity
y
whic
h
encourages
the
output
of
the
encoder
to
be
sparse.
These
are
penalties that ha
hav
ve been explored include the Studen
Student-t
t-t penalt
penalty
y (Oldescrib
in Field,
Section1996;
15.8.1.
Besides
the L1
penalty
, other
y
shausenedand
Bergstra,
2011),
TODO:
should
the sparsit
t be in
penalties that have been explored include the Student-t penalty (Ol-

469
469
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

math mo
mode,
de, perhaps?
math mode, perhaps?
(i.e. wh
where
ere h
penalt
enalty
y (Lee et
(i.e. 2008a)
where h
gio,
penalty (Lee et
gio, 2008a)

log(1 + h )

log(1 + h )
has a Studen
Student-t
t-t prior densit
density)
y) and the KL-divergence
al.
al.,, 2008; Go
Goodfellow
odfellow et al.
al.,, 2009; Laro
Larochelle
chelle and Benhas a Student-t prior density) and the KL-divergence
al., 2008; Go odfellow et al., 2009; Larochelle and Ben(t log h + (1 t) log
)),,

log(1
(1 h ))

(t log h + (1 t) log(1 h )),


with a target sparsity level t, for h (0
(0,, 1), e.g. through a sigmoid

non-linearit
non-linearity
y.
with a target sparsity level t, for h
(0, 1), e.g. through a sigmoid
Con
Contractiv
tractiv
tractive
autoenco
enco
encoders
ders (Rifai et al.
al.,, 2011b), co
covered
vered in Secnon-linearit
ye. auto

tion 15.10, explicitly penalize || || , i.e., the sum of the squared norm
Contractive autoenco ders (Rifai et al., 2011b), covered in Secof the vectors
(eac
(each
h indicating how much each hidden unit h
tion 15.10, explicitly penalize
, i.e., the sum of the squared norm
resp
responds
onds to cchanges
hanges in x and what direction of cchange
hange in x that unit is
|| ||
of thesensitiv
vectorse to, around
(eacha indicating
how
much suc
each
unit h
most
sensitive
particular
x). With
such
h ahidden
regularization
resp
onds
to cauto-enco
hanges inder
x and
what contractiv
direction ofechange
in xthe
that
unit is
penalt
is called
because
mapping
enalty
y, the
auto-encoder
contractive
most input
sensitiv
aroundtation
a particular
x). Withtosuc
a regularization
from
xetoto,represen
representation
h is encouraged
bh
e contractiv
contractive,
e, i.e.,
p
y, small
the auto-enco
der
is called
contractiv
e b
ecause
the mapping
toenalt
ha
deriv
all directions.
Note
that
a sparsit
have
ve
derivativ
ativ
atives
es in
sparsity
y regufrom
inputindirectly
x to represen
h istractiv
encouraged
to b eascontractiv
e, i.e.,
larization
leadstation
to a con
well, when
the
contractiv
tractive
e mapping
to
have small
derivhapp
atives
allhav
directions.
Noteativ
that
ah
sparsit
regunon-linearit
= 0 y(whic
non-linearity
y used
happens
ensin to
have
e a zero deriv
derivativ
ative
e at
(which
h
larization
indirectly
leads
to
a
con
tractiv
e
mapping
as
well,
when
the
is the case for the sigmoid non-linearity).
non-linearity used happens to have a zero derivative at h = 0 (which
Robustness
is the casetoforinjected
the sigmoid
noise
non-linearity).
or missing information
information:: if noise is
injected in inputs or hidd
hidden
en units, or if some inputs are missing, while the
Robustness
injected
or the
missing
information
: if ,noise
is
neural
net
is ask
construct
cle
complete input
then it
network
workto
asked
ed to renoise
clean
an and
input,
in inputs
or the
hiddid
en
units,
or if some
missing,
the
injected
cannot simply
learn
It inputs
has to are
capture
the while
structure
identit
entit
entity
y function.
neural
network
is asked to
econstruct
the cleanpand
complete
input, then it
of
the data
distribution
in rorder
to optimally
erform
this reconstruction.
cannot
simply learn
thecalled
identit
y function.
It hasoders
to capture
structure
Suc
Such
h auto-enco
auto-encoders
ders are
denoising
auto-enc
auto-enco
and arethe
discussed
in
of thedetail
data distribution
in order to optimally perform this reconstruction.
more
in Section 15.9.
Such auto-enco ders are called denoising auto-encoders and are discussed in
more detail in Section 15.9.

15.2

Denoising Auto-enco
Auto-encoders
ders

There
tigh
tightt connection
bet
between
ween theders
denoising auto-enco
auto-encoders
ders and the
15.2 is aDenoising
Auto-enco
con
contractive
tractive auto-encoders: it can be shown (Alain and Bengio, 2013) that
There
a tigh
connection
between
theinput
denoising
ders and
the
in
the is
limit
of tsmall
Gaussian
injected
noise,auto-enco
the denoising
reconcontractive
auto-encoders:
be tractive
shown (Alain
2013) that
struction
error
is equiv
equivalen
alen
alenttittocan
a con
contractive
penaltyand
on Bengio,
the reconstruction
in the limit
smallx Gaussian
input
noise,
thesince
denoising
reconfunction
thatofmaps
to r = g(injected
f (x)). In
other
words,
both x
and
struction error is equivalent to a contractive penalty on the reconstruction
A function
) is contractiv
contractive
nearbywords,
nearby
and ,since
or equivalently
equiv
alently
function( that
maps xe ifto r( )= g( ()f (x)). Inforother
both
x and

if its deriv
derivative
ative

( )

1.

if its derivative

( )

1.

470
470

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

x + (where is some small noise vector) m


must
ust yield the same target output
x, the reconstruction function is encouraged to be insensitiv
insensitivee to changes in
x +directions
(where .is The
someonly
small
noise
vector)
mustreconstruction
yield the samertarget
all
thing
th
that
at prev
prevents
ents
from output
simply
xeing
, the areconstruction
function insensitiv
is encouraged
to binput
e insensitiv
to changes
in
b
constant (completely
insensitive
e to the
x), ise that
on
onee also
all directions
. The
only thing
that prevtraining
ents reconstruction
from
simply
has
to reconstruct
correctly
for dierent
examples x.rHow
Howev
ev
ever,
er, the
b
eing a constant
e to the
inputt around
x), is that
one also
auto-enco
auto-encoder
der can (completely
learn to be insensitiv
appro
approximately
ximately
constan
constant
training
exhas to reconstruct
correctly
for dierent
training
examples
x. How
ever, the
amples
x while pro
producing
ducing a dierent
answ
answer
er for dierent
training
examples.
auto-enco
der in
canSection
learn to
be appro
t around
training exAs
discussed
17.4,
if theximately
examplesconstan
are near
a lo
low-dimensional
w-dimensional
amples x while
producing ath
dierent
answer for
training
manifold,
this encourages
the
e representation
todierent
vary only
on theexamples.
manifold
As discussed
Sectiont in
17.4,
if the examples
aretonear
a low-dimensional
and
be locallyinconstan
constant
directions
orthogonal
the manifold,
i.e., the
manifold,
this locally
encourages
the representation
to vEuclidean,
ary only onnot
the
manifold
represen
representation
tation
captures
a (not necessarily
necessarily
and be locally
constantsystem
in directions
orthogonalInto
the manifold,
i.e., the
orthogonal)
co
coordinate
ordinate
for the manifold.
addition
to the denoising
represen
tation
locally
capturesauto-enc
a (not necessarily
Euclidean,
not necessarily
auto-enco
auto-encoder,
der, the
variational
auto-enco
oder (Section
20.9.3) and
the generorthogonal)
coordinate
system
for the
manifold.
In addition
to the denoising
ative
sto
stochastic
chastic
networks
(Section
20.12)
also in
involv
volv
volve
e the injection
of noise,
auto-enco
der,inthe
auto-encoder
(Section
20.9.3)
and the
the notion
generbut
typically
thevariational
represen
representation-space
tation-space
itself,
thus in
intro
tro
troducing
ducing
ative
stoachastic
networks. (Section 20.12) also involve the injection of noise,
of
h as
latent variable
variable.
but typically in the representation-space itself, thus introducing the notion
of
Pressure
of a variable
Prior .on the Represen
Representation:
tation: an in
interesting
teresting way to
h as a latent
generalize the notion of regularization applied to the represen
representation
tation is to
Pressure
a cost
Prior
on the
tation:
interesting
in
function
for Represen
the auto-enco
log-prior
term way to
intro
tro
troduce
duce inofthe
auto-encoder
der a an
generalize the notion of regularization applied to the representation is to
the
log Pauto-enco
(h)
introduce in the cost function for
der a log-prior term
whic
w)ould lik
which
h captures the assumption that
likee to nd a represen
representation
tation
logwe
P (h
that has a simple distribution (if
such
h as a fac P (h) has a simple form, suc
which captures
the assumption
would
like tothan
nd the
a represen
torized
distribution
), or at leastthat
one we
that
is simpler
originaltation
data
that has a simple
distribution
(if
P (h
) has a simple
form, like
suchto as
a facdistribution.
Among
all the enco
encoding
ding
functions
f , we would
pick
one
torized distribution ), or at least one that is simpler than the original data
that
distribution. Among all the encoding functions f , we would like to pick one
1. can be inv
inverted
erted (easily), and this is ac
achiev
hiev
hieved
ed by minimizing some rethat
construction loss, and
1. can be inverted (easily), and this is achieved by minimizing some re2. yields represen
representations
tations h whose distribution is simpler, i.e., can be
construction loss, and
captured with less capacit
capacity
y than the original training distribution itself.
2. yields representations h whose distribution is simpler, i.e., can be
The sparse
variants
describ
described
ed ab
abov
ove
e clearly
fall in training
that framew
framework.
ork. Theitself.
varicaptured
with less
capacit
y ov
than
the original
distribution
ational auto-enco
auto-encoder
der (Section 20.9.3) provides a clean mathematical frameThe
variantsthe
describ
ed
above clearly
fall in el
that
framew
The
vari-e
w
orksparse
for justifying
abov
above
e pressure
of a top-lev
top-level
prior
whenork.
the ob
objectiv
jectiv
jective
ational
auto-enco
der generating
(Section 20.9.3)
provides a clean mathematical frameis
to model
the data
distribution.
work for justifying the above pressure of a top-level prior when the ob jective
From
poin
point
t of data
viewgenerating
of regularization
(Chapter 7), add
adding
ing the log P (h)
is tothe
model
the
distribution.
term to the ob
objective
jective function (e.g. for encouraging sparsit
sparsity)
y) or adding a con
contractracthedo
poin
of view
of regularization
ing the log
P (h)
tiv
tiveeFprom
enalty
nott t
the traditional
view of (Chapter
a prior on7),
theadd
parameters.
Instead,
term to the objective function (e.g. for encouraging sparsity) or addinga contrac-

all the sparse priors we ha


have
ve described corresp
correspond
ond to a factorized distribution
471 ond to a factorized distribution
all the sparse priors we have described corresp
471
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

the prior on the laten


latentt v
variables
ariables acts like a data-dep
data-dependent
endent prior
prior,, in the sense that
it depends on the particular values h that are going to be sampled (usually from a
the
prior on
theencoder),
latent variables
acts
a data-dep
endent
, in indirectly
the sense,that
posterior
or an
based on
thelike
input
example
x. Ofprior
course,
this
indirectly,
it
on the particular
h that are
to be
sampled
from a
is depends
also a regularization
on thevalues
parameters,
butgoing
one that
dep
the particular
depends
ends on(usually
p
osterior
or an encoder), based on the input example x. Of course, indirectly, this
data
distribution.
is also a regularization on the parameters, but one that depends on the particular
data distribution.

15.3

Represen
Representational
tational Power, La
Lay
yer Size and Depth

Auto
Autoenco
encoders
ders are often
trained with
single
layer
yer
encoder
a single
15.3enco
Represen
tational
Poonly
wer,a La
yerla
Size
andand
Depth
la
layer
yer decoder. Ho
Howev
wev
wever,
er, this is not a requirement, and using deep enco
encoders
ders and
Auto
encooers
ders are
often
trained
deco
decoders
ders
man
many
y adv
advan
an
antages.
tages. with only a single layer encoder and a single
layer
decoder.
wev
er,that
thisthere
is not
requirement,
andtousing
deep
ders and
Recall
from Ho
Sec.
6.6
area many
adv
advantages
antages
depth
in aenco
feed-forward
deco
ders oers
man
y advantages.
net
network.
work.
Because
auto-enco
auto-encoders
ders are feed-forward net
netw
works, these adv
advan
an
antages
tages also
Recall
from Sec.ders.
6.6 that
there
advantages
depth inard
a feed-forward
apply
to auto-enco
Moreo
themany
enco
is itself atofeed-forw
net
auto-encoders.
Moreover,
ver,are
encoder
der
feed-forward
network
work as
netthe
work.
Because
auto-enco
derscomp
are feed-forward
works, these
advan
tages also
is
decoder,
so each
of these
componen
onen
onents
ts of thenet
auto-enco
auto-encoder
der can
individually
apply
auto-enco
b
enettofrom
depth.ders. Moreover, the encoder is itself a feed-forward network as
is the
decoder,
so each
of of
these
comp onen
ts ofisthe
auto-enco
der
canappro
individually
One
ma
major
jor adv
advantage
antage
non-trivial
depth
that
the univ
universal
ersal
approximator
ximator
benet from
depth.
theorem
guaran
guarantees
tees that a feedforward neural net
netw
work with at least one hidden
One
jor advtantage
of non-trivial
is that the
universal
approclass)
ximator
la
layer
yer
canma
represen
represent
an appro
approximation
ximation ofdepth
any function
(within
a broad
to
theorem
guaran
tees of
that
a feedforward
neural
work
with at
least units.
one hidden
an
arbitrary
degree
accuracy
accuracy,
, pro
provided
vided
that net
it has
enough
hidden
This
la
yer can
represen
t an appro
ximation
of any
function
class)the
to
means
that
an autoenco
a single
hidden
lay
is able atobroad
represent
autoencoder
der with
layer
er(within
an
arbitrary
degree
of accuracy
, provided
has enoughwhidden
units.
iden
identity
tity function
along
the domain
of thethat
datait arbitrarily
ell. How
However,
ever,This
the
means
that
an
autoenco
der
with
a
single
hidden
lay
er
is
able
to
represent
the
mapping from input to code is shallo
shallow.
w. This means that we are not able to
identity arbitrary
function along
thets,domain
the d
ataco
arbitrarily
However,
the
enforce
constrain
constraints,
suc
such
h asof that
the
code
de should w
bell.
e sparse.
A deep
mapping
fromwith
input
code
shallow. hidden
This means
e are
not able
to
auto
autoenco
enco
encoder,
der,
at to
least
oneisadditional
la
lay
yer that
insidewthe
encoder
itself,
enforce
arbitrary
constrain
ts, suc
h as
thattothe
code
should bweell,
sparse.
deep
can
appro
approximate
ximate an
any
y mapping
from
input
co
code
de
arbitrarily
giv
given
en A
enough
autoenco
der, with at least one additional hidden layer inside the encoder itself,
hidden
units.
canThe
appro
ximate
anyoint
mapping
fromates
input
to code arbitrarily
wders,
ell, giv
en enough
ab
abov
ov
ovee viewp
viewpoint
also motiv
motivates
ov
overcomplete
ercomplete
auto
autoenco
enco
encoders,
that
is, auhidden
units.
to
toenco
enco
encoders
ders
with very wide la
lay
yers, in order to ac
achiev
hiev
hievee a rich family of possible
The above viewpoint also motivates overcomplete autoencoders, that is, aufunctions.
toenco
derscan
with
very
wide la
yers, in
to achievecost
a rich
family
of p
Depth
exp
exponentially
onentially
reduce
theorder
computational
of ev
evaluating
aluating
a ossible
reprefunctions.
sen
sentation
tation of some functions, and can also exp
exponen
onen
onentially
tially dec
decrease
rease the amoun
amountt of
Depthdata
can needed
exponentially
the computational cost of evaluating a repretraining
to learnreduce
some functions.
sentation
of some
and can
expmonen
rease the amoun
t of
Exp
Experimen
erimen
erimentally
tally
tally,functions,
, deep auto-enco
auto-encoders
dersalso
yield
uch tially
betterdec
compression
than cortraining
data
neededortolinear
learnauto-enco
some functions.
resp
responding
onding
shallow
auto-encoders
ders (Hin
(Hinton
ton and Salakhutdino
Salakhutdinov,
v, 2006).
Exp
erimentally
, deepfor
auto-enco
yieldauto
much
better
corA
common
strategy
trainingders
a deep
autoenco
enco
encoder
der iscompression
to greedily than
pre-train
responding
shallow
or blinear
auto-enco
ders
tonwand
Salakhutdinoso
v, we
2006).
the
deep arc
architecture
hitecture
y training
a stac
stack
k of(Hin
shallo
shallow
auto-encoders,
often
A common
strategy
for training
a deep
der isgoal
to greedily
pre-train
encoun
encounter
ter shallo
shallow
w auto-encoders,
ev
even
en
when auto
the enco
ultimate
is to train
a deep
the
deep arc
auto-enco
auto-encoder.
der.hitecture by training a stack of shallow auto-encoders, so we often
encounter shallow auto-encoders, even when the ultimate goal is to train a deep

472
472
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

15.4

Reconstruction Distribution

The
abov
vReconstruction
e parts (enco
(encoder
der function
f , decoder function g, reconstruction loss
15.4abo
Distribution
L) mak
makee sense when the loss L is simply the squared reconstruction error, but
The abo
parts
(enco
der this
function
decoder function
g, reconstruction
there
areveman
many
y cases
where
is notf ,appropriate,
e.g., when
x is a vectorloss
of
L) makev
sense when
the loss
simply
theappro
squared
reconstruction
error,
but
discrete
variables
ariables
or when
P (x L| his) is
not well
approximated
ximated
by a Gaussian
distrithere are. Just
manylik
cases
where
is not
appropriate,
e.g., orks
when(starting
x is a vwith
ectorthe
of
bution
like
e in the
casethis
of other
types
of neural netw
networks
discrete
variables
when
P (xSection
h) is not
wellitappro
ximated
Gaussian
distrifeedforw
feedforward
ard
neuralornet
netw
works,
6.3.2),
is conv
convenient
enientby
to adene
the loss
L
bution
. Justlog-likelihoo
like in the case
of| some
other target
types of
neuralv
netw
orks (starting
with the
as
a negative
log-likelihood
d ov
over
er
random
variables.
ariables.
This probabilistic
feedforw
ard neural
networks,imp
Section
is convenient
to dene
the loss
L
in
interpretation
terpretation
is particularly
important
ortant6.3.2),
for theitdiscussion
in Sections
20.9.3,
20.11
as a negative
log-likelihoo
de ov
er some target
randomders
variables.
This
probabilistic
and
20.12 ab
about
out
generativ
generative
extensions
of auto-enco
auto-encoders
and sto
stochastic
chastic
recurrent
in
terpretation
is the
particularly
for the discussion
in Sections
20.9.3, 20.11
net
networks,
works, where
output ofimp
theortant
auto-encoder
is interpreted
as a probability
disand 20.12Pab
e extensionsxof
auto-enco
dersunits
and hsto
chastic
recurrent
tribution
(xout
| hgenerativ
), for reconstructing
, giv
given
en hidden
. This
distribution
networks,not
where
of the
auto-encoderbut
is interpreted
asertainty
a probability
captures
justthe
theoutput
exp
expected
ected
reconstruction
also the unc
uncertainty
about disthe
tributionxP(whic
(x h
fore reconstructing
given hidden units
. cThis
distribution
original
(which
h),gav
gave
rise to h, eitherx,deterministically
or h
sto
stoc
hastically
hastically,
, given
captures
notsimplest
just
exp ected
alsodistribution
the uncertainty
about i.e.,
the
| the and
h
). In the
mostreconstruction
ordinary cases,but
this
factorizes,
original
e ).rise
to co
h,vers
either
chastically
, given
Ph(xgav
P
(x | h)x=(whic
|h
This
cov
thedeterministically
usual cases of x |orh sto
being
Gaussian
(for
h
).bounded
In the real
simplest
and
cases, this
distribution
i.e.,
un
unb
values)
andmost
x |h ordinary
having a Bernoulli
distribution
(forfactorizes,
binary values
P (readily
x h). generalize
P (),x but
h) one
= can
This coversthis
the to
usual
cases
of x h being
(for
x
other
distributions,
suchGaussian
as mixtures
unbounded
real
values)
x h having a Bernoulli distribution
(for binary values
|
|andand
|
(see
Sections
3.10.6
6.3.2).
x ),Th
but
can generalize
readily generalize
thisoftode
other
ascomixtures
Thus
us one
we can
the| notion
dec
codingdistributions,
function g(h)such
to de
dec
ding dis(see Sections
6.3.2).
tribution
P (x3.10.6
| h). and
Similarly
Similarly,
, we can generalize the notion of enc
enco
oding function
us enc
we ocan
theQnotion
ding function
g(h) to15.3.
decoding
f (xTh
) to
enco
dinggeneralize
distribution
(h | xof
), de
ascoillustrated
in Figure
We disuse
tribution
P (x h
). fact
Similarly
we can
generalizeatthe
encorepr
dingesentation
function
this
to capture
the
that, noise
is injected
thenotion
lev
level
el ofof the
representation
f (x) to encoding
Q(h x), as illustrated in Figure 15.3. We use
| distribution
h, no
lik
now
w considered
likee a laten
latentt variable. This generalization is crucial in the
this
to capture
the vfact
that noise
is
at the 20.9.3)
level ofand
the the
reprgeneralized
esentation
| injected
dev
of the
ariational
auto-enco
development
elopment
auto-encoder
der (Section
h
, cno
w considered
e a laten
t variable. This generalization is crucial in the
sto
hastic
net
(Section
20.12).
stoc
networks
works lik
development
of the
variationalencoder
auto-enco
(Section
and
We also nd
a stochastic
andder
a sto
deco
inthe
thegeneralized
RBM, destoc
chastic20.9.3)
decoder
der
sto
chastic
works20.2.
(Section
20.12).
scrib
Section
In that
case, the encoding distribution Q(h | x) and
scribed
ed in net
ndh,
a stochastic
encoder
der there
in theisRBM,
deP (xW| ehalso
) matc
match,
in the sense
that Qand
(h |axsto
) =chastic
P (h | deco
x), i.e.,
a unique
scrib
in Sectionwhich
20.2. has
In both
that Q
case,
encoding
Q(h x
) and
join
jointt ed
distribution
(h | the
x) and
P (x | distribution
h) as conditionals.
This
is
P (xtrue
h)inmatc
h, for
in the
that
Q(h parametrized
x) = P (h conditionals
x), i.e., therelike
is a|Q(unique
not
general
tw
two
o sense
indep
independently
endently
h | x)
jointP|distribution
which the
has wboth
Q(generativ
h x)| and
P (x |h) as
conditionals.
is,
and
(x | h), although
ork on
generative
e stochastic
net
networks
works (AlainThis
et al.
al.,
not true
in
for twowill
indep
endently
like Q(h(with
x)
| eparametrized
| conditionals
2015)
sho
shows
wsgeneral
that learning
tend
to mak
make
them compatible
asymptotically
and P (xcapacity
h), although
the work on generative stochastic networks (Alain et |al.,
enough
and examples).
2015) shows
| that learning will tend to make them compatible asymptotically (with
See the
link betw
between
een examples).
squared error and normal densit
density
y in Sections 5.6 and 6.3.2
enough
capacity
and
See the link between squared error and normal density in Sections 5.6 and 6.3.2

473
473

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

Figure 15.3: Basic sc


scheme
heme of a stochastic auto-enco
auto-encoder,
der, in which both the encoder and
the decoder are not simple functions but instead in
involv
volv
volvee some noise injection, me
meaning
aning
Figure
15.3:
Basiccan
scheme
of aasstochastic
auto-enco
der, in which
and
that their
output
be seen
sampled from
a distribution,
Q(hboth
| x)the
for encoder
the enco
encoder
der
the
decoder
are
not
simple
functions
but
instead
in
volv
e
some
noise
injection,
me
aning
and P (x | h) for the decoder. RBMs are a special case where P = Q (in the sense of
theirjoint
output
can onding
be seentoasboth
sampled
from a distribution,
Q(these
h x)two
fordistributions
the encoder
athat
unique
corresp
corresponding
conditinals)
but in general
and
P (xnecessarily
h) for the
decoder. distributions
RBMs are a compatible
special casewith
where
P =| Qjoint
(in the
sense of
are not
conditional
a unique
distribution
a (unique
corresponding to both conditinals) but in general these two distributions
|
P
x, h). joint
are not necessarily conditional distributions compatible with a unique joint distribution
P (x, h).

15.5

Linear Factor Mo
Models
dels

No
Now
w thatLinear
we ha
hav
ve introduced
the notion
decoder,
der, let us fo
focus
cus
15.5
Factor Mo
dels of a probabilistic deco
on a very special case where the latent variable h generates x via a li
linear
near transNow that we
ve introduced
the notion
of a probabilistic
deco
us focus
formation
plushanoise,
i.e., classical
linear factor
mo
models,
dels, whic
which
h doder,
notlet
necessarily
on
a vaery
special
case parametric
where the latent
variable h generates x via a linear transha
have
ve
corresp
corresponding
onding
encoder.
formation
plus
i.e., classical
linearfactors
factor that
models,
h do not
necessarily
The idea
of noise,
discov
discovering
ering
explanatory
ha
have
vewhic
a simple
join
joint
t distribuhave among
a corresp
onding parametric
tion
themselves
is old, e.g.,encoder.
see Factor Analysis (see below), and has been
The idea
ering
factors that ha
ve
a simple
join
t distribuexplored
rstofindiscov
the con
context
textexplanatory
where the relationship
bet
etween
ween
factors
and
data is
tion among
is old,
seewF
Analysis
(see ws.
below),
and
has been
linear,
i.e., wthemselves
e assume that
thee.g.,
data
asactor
generated
as follo
follows.
First,
sample
the
explored
rst
in the context where the relationship between factors and data is
real-v
real-valued
alued
factors,
linear, i.e., we assume that the data
as follows. First, sample
the
hw
as
P (generated
h),
(15.2)
real-valued factors,
and then sample the real-v
real-valued
alued observ
observable
h able
P (h)v,ariables given the factors:
(15.2)

and then sample the real-valued


x =observ
W h able
+ b +variables
noise given the factors:
(15.3)
x = W h and
+b +
noise (indep
(15.3)
where the noise is typically Gaussian
diagonal
(independent
endent across dimensions). This is illustrated in Figure 15.4.
where the noise is typically Gaussian and diagonal (independent across dimen-

474
474

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

Figure 15.4: Basic scheme of a linear factors mo


model,
del, in whic
which
h we assume that an observed
data vector x is obtained by a linear combination of latent factors h, plus some noise.
Figure
15.4:
Basicsuc
scheme
of a linear factors
mo del,analysis
in which
assume
an observed
Dieren
Dierent
t models,
such
h as probabilistic
PCA, factor
orwe
ICA,
makethat
dieren
dierent
t choices
data
v
ector
x
is
obtained
b
y
a
linear
combination
of
latent
factors
h
,
plus
some
noise.
ab
about
out the form of the noise and of the prior P (h).
Dierent models, such as probabilistic PCA, factor analysis or ICA, make dierent choices
about the form of the noise and of the prior P (h).

15.6

Probabilistic PCA and Factor Analysis

Probabilistic
PCA (PrincipalPCA
Comp
Components
onents
factor analysis and other
15.6 Probabilistic
and Analysis),
Factor Analysis
linear factor models are sp
special
ecial cases of the abo
above
ve equations (15.2 and 15.3) and
Probabilistic
(Principal
onents
Analysis),
factor
and other
only
dier in PCA
the choices
madeComp
for the
prior
(o
(over
ver laten
latent,
t, notanalysis
parameters)
and
lineardistributions.
factor models are special cases of the above equations (15.2 and 15.3) and
noise
onlyIndier
the choices
made for the
prior
(over laten
t, notthe
parameters)
and
factorinanalysis
(Bartholomew,
1987;
Basilevsky,
1994),
laten
latentt variable
noise distributions.
prior is ju
just
st the unit variance Gaussian
In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable
prior is just the unit variance Gaussian
h N (0
(0,, I )
(0,to
I )be conditional
while the observ
observed
ed v
variables
ariables x are h
assumed
onditionally
ly indep
independent
endent
endent,, giv
given
en
h, i.e., the noise is assumed to be coming
covariance
ariance Gaussian
N from a diagonal cov
while the observ
ed vco
ariables
x matrix
are assumed
be c
onditional
distribution,
with
cov
variance
=todiag(
diag(
), with ly indep
= (endent
, ,,. giv
. .) en
a
hector
, i.e., of
the
noise
is assumed
to be coming from a diagonal covariance Gaussian
v
per-v
er-variable
ariable
variances.
distribution,
with
variance
matrixis
= to
diag(
),e the
withdep
endencies
= ( , b,etw
. . .)een
a
The role of
the co
laten
latent
t variables
thus
captur
apture
dependencies
etween
vector
of per-v
ariable
the
dieren
dierent
t observ
observed
ed vvariances.
ariables x . Indeed, it can easily be sho
shown
wn that x is just
The role of the laten(m
t vultiv
ariables
thus to rand
captur
e the
dependencies
between
a Gaussian-distribution
(multiv
ultivariate
ariateis normal)
random
om
variable,
with
the dierent observed variables x . Indeed, it can easily be shown that x is just
a Gaussian-distribution (multiv
ariate
rand
x
N (b,normal)
WW +
)om variable, with
x W induce
(b, W W
+endency
)
where we see that the weights
a dep
dependency
betw
between
een two variables x
and x through a kind of auto-encoder
path, whereby x inuences h = W x
N
where we see that the weights W induce a dependency between two variables x

inuences x via w .
via w (for every k) and h
In order to cast PCA in a probabilistic framew
framework,
ork, we can mak
makee a sligh
slightt
mo
modication
to the
mo
model,
del,via
makin
making
via dication
w (for every
k) factor
and h analysis
inuences
x
w .g the conditional variances
In order to cast PCA in a probabilistic
framework, we can make a slight
475
modication to the factor analysis model, making the conditional variances
475

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

equal to eac
each
h other. In that case the cov
covariance
ariance of x is just W W + I , where
is now a scalar, i.e.,
equal to each other. In that xcase
the
N
(b,cov
Wariance
W + ofIx) is just W W + I , where
is now a scalar, i.e.,
or equiv
equivalen
alen
alently
tly
x
(b, W W + I )
xN= W h + b + z

or equivalently
where z N (0
(0,, I ) is white noise.
how
w an
x = WTipping
h + b + and
z Bishop (1999) then sho
iterativ
iterativee EM algorithm for estimating the parameters W and .
where
z the probabilistic
(0, I ) is white
noise.
Tipping
and Bishop
thenco
svho
w an
What
PCA
model
is basically
sa
saying
ying is(1999)
that the
cov
ariance
iterativ
eE
MNalgorithm
for estimating
the parameters
W and
. residual reconcaptured
is
mostly
by the
laten
latentt variables
h, up to some
small
What err
theorprobabilistic
PCA
is and
basically
saying
is that
the covariance
struction
error
. As sho
shown
wn
by model
Tipping
Bishop
(1999),
probabilistic
PCA
isecomes
mostlyPCA
captured
by 0.
theInlaten
variables
h, up to some
smallvalue
residual
econb
as
thatt case,
the conditional
expected
of h rgiv
given
en
struction
ororthogonal
. As shoprojection
wn by Tipping
andspace
Bishop
(1999),byprobabilistic
PCA
x
becomeserr
an
on
onto
to the
spanned
the d columns
of
becomes
PCA
as See
Section
0. In that
conditional
expected
valuemec
of h
given
W
, lik
likee in
PCA.
17.1case,
for athe
discussion
of the
inference
mechanism
hanism
x
becomes
an orthogonal
projectionoronnot),
to thei.e.,
space
spanned
byexp
theected
d columns
(probabilistic
asso
PCA
reco
value of
of
associated
ciated with
recovering
vering the
expected
W , laten
like in
PCA. See
Section
for a discussion
of thesection
inference
mechanism
the
latent
t factors
h giv
given
en the17.1
observed
input x. That
also explains
the
asso
withge
PCA
(probabilistic
or not),
i.e.,
recovering
the exp ected value of
v
eryciated
insightful
geometric
ometric
and manifold
interpr
interpretation
etation
of PCA.
the Ho
laten
t er,
factors
thedensity
observed
input
x. That
section
explains
Howev
wev
wever,
as hgiv
0,enthe
model
becomes
very
sharpalso
around
thesethe
d
v
ery insightful
geometric
and manifold
etationinofSection
PCA. 17.1, which w
dimensions
spanned
the columns
of W , interpr
as discussed
would
ould
Ho
wev
er,
as

0,
the
density
model
becomes
v
ery
sharp
around
these
d
not mak
makee it a very faithful mo
model
del of the data, in general (not just because the
dimensions
spanned
the columns of W , manifold,
as discussed
Section
17.1, which
would
data
may liv
live
e on a
higher-dimensional
butinmore
importan
importantly
tly because
not real
makedata
it amanifold
very faithful
model
thehyperplane
data, in general
(not just17because
the
the
ma
may
y not
be aofat
- see Chapter
for more).
data may live on a higher-dimensional manifold, but more importantly because
the real data manifold may not be a at hyperplane - see Chapter 17 for more).

15.6.1

ICA

Indep
Independen
enden
endent
t Component Analysis (ICA) is among the oldest represen
representation
tation learn15.6.1
ICA
ing algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994;
Indep
endent1999;
Component
Analysis
is among
theapproach
oldest represen
tation learnarinen,
arinen
Hyv
arinen,
Hyvarinen
et al.
al.,(ICA)
, 2001).
It is an
to modeling
linear
ing algorithms
(Herault
and Ans,pro
1984;
Jutten
anddata.
Herault,
Comon, 1994;
factors
that seeks
non-Gaussian
projections
jections
of the
Lik
Likee1991;
probabilistic
PCA
Hyvarinen,
1999; Hyv
arinen
, 2001).
It is
an approach
to and
modeling
linear
and
factor analysis,
it also
tsettheal.linear
factor
model
of Eqs. 15.2
15.3. What
factors
that ab
seeks
projections
of the
data.
Like it
probabilistic
PCA
is
particular
about
outnon-Gaussian
ICA is that unlike
PCA and
factor
analysis
do
does
es not assume
and factor
analysis,
it also
ts is
theGaussian
linear factor
modelassumes
of Eqs. 15.2
What
that
the latent
variable
prior
Gaussian.
. It only
that and
it is15.3.
factorize
factorized
d,
is particular about ICA is that unlike PCA and factor analysis it does not assume
i.e.,
that the latent variable prior isPGaussian
d,
(h) = . PIt(honly
). assumes that it is factorize
(15.4)
i.e.,
P (h) =
P (h ).
(15.4)
Since there is no parametric assumption behind the prior, we are really in front
of a so-called semi-p
semi-par
ar
arametric
ametric mo
model
del
del,, with parts of the model being parametric
Since there is no parametric assumption behind the prior, we are really in front
of a so-called semi-parametric model, with parts of the model being parametric

(P (x | h)) and parts being non-specied or non-parametric (P (h)). In fact, this


typically yields to non-Gaussian priors: if the priors were Gaussian, then one
could
etw
etween
een
the factorsorhnon-parametric
and a rotation(P
of(h
h)).
. Indeed,
(P (x not
h)) distinguish
and parts bbeing
non-specied
In fact, note
this
typically yields to non-Gaussian priors: if the priors were Gaussian, then one
could |not distinguish between the factors h and a rotation of h. Indeed, note
476
476

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

that if
h = Uz

that if
with U an orthonormal (rotation) square
i.e.,
h = Umatrix,
z
z = U matrix,
h,
with U an orthonormal (rotation) square
i.e.,
then, although h migh
mightt ha
have
ve a Normal(0
Normal(0,
z = U, Ih) , distribution, the z also have a unit
covarianc
ovariancee, i.e., they are uncorrelated:
then, although h might have a Normal(0, I ) distribution, the z also have a unit
ar[
[z ] they
= [zare
z ]uncorrelated:
= [U hh U ] = U V ar
ar[[h]U = U U = I .
covariancV
e,ar
i.e.,
In other wVords,
factors
imposing
independence
allow
w one
ar[z imp
] = osing
[zz indep
] = [endence
U hh among
U ] = UGaussian
V ar[h]U
= U does
U =not
I . allo
to disentangle them, and we could as well recov
recover
er any linear rotation of these facIn
other
words,that,
imposing
endence among
factors
not allo
onet
tors.
It means
giv
the observed
x, ev
though w
assume
thewrigh
given
enindep
even
enGaussian
we
e mightdoes
right
to
disentangle
them,
andcannot
we could
aserwthe
ell recov
er any
linear erotation
these
facgenerativ
generative
e mo
model,
del,
PCA
recov
recover
original
generativ
generative
factors. ofHo
Howev
wev
wever,
er,
if
tors.
It means
given the
observed
, even though, then
we might
assume
thethem,
right
w
e assume
thatthat,
the latent
v
variables
ariables
are x
non-Gaussian
non-Gaussian,
we can
reco
recover
ver
generativ
e mo
del,ICA
PCAiscannot
er
the
e factors.
Howev
er, if
and this is
what
trying recov
to ac
In fact, generativ
under these
generativ
achiev
hiev
hieve.
e. original
generative
e model
we assume that
latent
variables
are non-Gaussian
, then
we can1994).
recoverInthem,
assumptions,
thethe
true
underlying
factors
can be recov
recovered
ered
(Comon,
fact,
and
this
is
what
ICA
is
trying
to
ac
hiev
e.
In
fact,
under
these
generativ
e
model
man
many
y ICA algorithms are looking for pro
projections
jections of the data s = V x suc
such
h that
assumptions,
the true
underlying factors
can beerecov
ered (Comon,
1994).
In fact,
they
are maximal
maximally
ly non-Gaussian
non-Gaussian.
. An intuitiv
intuitive
explanation
for these
approaches
man
y ICA
algorithms
are looking
for pro jections
data s = V xalmost
such that
is
that
although
the true
laten
latentt variables
h ma
may
y of
bethe
non-Gaussian,
any
they are
maximal ly of
non-Gaussian
. Anmore
intuitiv
e explanation
approaches
linear
combination
them will look
Gaussian,
becausefor
of these
the cen
central
tral limit
is
that although
the true
laten
t variables
non-Gaussian,
almost any
theorem.
Since linear
com
of thehxma
s yarebealso
linear combinations
of
combinations
binations
linear
combination
lookneed
moretoGaussian,
because
of the central
the
h s,
to recov
recover
er of
thethem
h s will
we just
nd the linear
combinations
thatlimit
are
theorem.
Since
linear
com
binations
of
the
x
s
are
also
linear
combinations
maximally non-Gaussian (while keeping these dierent pro
projections
jections orthogonal of
to
the
s, to recover the h s we just need to nd the linear combinations that are
eac
each
hhother).
maximally
keeping
these
jections
orthogonal
to
There isnon-Gaussian
an in
interesting
teresting(while
connection
bet
between
weendierent
ICA andpro
sparsity
sparsity,
, since
the domeac
ht other).
inan
inant
form of non-Gaussianity in real data is due to sparsit
sparsity
y, i.e., concentration
There is yanatinor
teresting
between
ICA andtypically
sparsity,hav
since
the domof probabilit
probability
near 0. connection
Non-Gaussian
distributions
have
e more
mass
inant form
non-Gaussianity
real
is due to sparsit
, i.e., concentration
around
zero,of although
you can in
also
getdata
non-Gaussianity
by yincreasing
sk
skewness,
ewness,
of
probabilit
y at
or near 0. Non-Gaussian distributions typically have more mass
asymmetry
asymmetry,
, or
kurtosis.
around
you can also
non-Gaussianity
bders
y increasing
ewness,
Lik
Likee zero,
PCAalthough
can be generalized
to get
non-linear
auto-enco
auto-encoders
describ
described
edsklater
in
asymmetry
, orICA
kurtosis.
this
chapter,
can be generalized to a non-linear generative mo
model,
del, e.g., x =
PCASee
canHyv
be arinen
non-linear
ders w
describ
later in
ageneralized
rinen and P
f (hLik
) + enoise.
Pa
atojunen
(1999) auto-enco
for the initial
work
ork oned
non-linear
this cand
hapter,
ICA can use
be generalized
to alearning
non-linear
generative
del,
e.g.,
x=
ICA
its successful
with ensem
ensemble
ble
b
by
y Roberts
andmo
Ev
Everson
erson
(2001);
f (h) + noise. et
See
arinen and Pa junen (1999) for the initial work on non-linear
Lappalainen
al.Hyv
(2000).
ICA and its successful use with ensemble learning by Roberts and Everson (2001);
Lappalainen et al. (2000).

15.6.2

Sparse Coding as a Generativ


Generative
e Model

15.6.2

Sparse Coding as a Generative Model

One particularly interesting form of non-Gaussianit


non-Gaussianity
y arises with distributions that
are sparse. These typically ha
have
ve not just a peak at 0 but also a fat tail . Lik
Likee the
probability
probabilit
yinteresting
going to 0 asform
the vof
alues
increase in magnitude
a rate
that is slower
slower that
than
Onewith
particularly
non-Gaussianit
y arisesatwith
distributions
are
sparse. These
have in
notthejust
a peak at 0 but also a fat tail . Like the
the Gaussian,
i.e., lesstypically
than quadratic
log-domain.
with probability going to 0 as the values increase in magnitude at a rate that is slower than
the Gaussian, i.e., less than quadratic in the log-domain.
477
477

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

other linear factor mo


models
dels (Eq. 15.3), sparse co
coding
ding corresp
corresponds
onds to a linear factor
mo
model,
del, but one with a sparse laten
latentt variable h, i.e., P (h) puts high probability
other
linear factor
models
(Eq.
15.3),
sparse co
ding corresp
ondst to
a linearprior
factor
at
or around
0. Unlik
Unlike
e with
ICA
(previous
section),
the laten
latent
variable
is
model, but one
a sparse
latent variable
i.e., Py (prior
h) puts
parametric.
Forwith
example
the factorized
Laplaceh,densit
density
is high probability
at or around 0. Unlike with ICA (previous section), the latent variable prior is
parametric. For example the factorized Laplace density prior is
P (h) =
P (h ) =
e
(15.5)
2

P (h) =
P (h ) =
e
(15.5)
2
and the factori
factorized
zed Studen
Student-t
t-t prior is
and the factorized Student-t prior is
P (h) =
P (h )

(15.6)

1 +1

P (h) =
P (h )
.
(15.6)
1 +for near-zero values but, unlik
Both of these densities ha
hav
ve a strong preference
unlikee

the Gaussian, accomodate large values. In the standard sparse co


coding
ding models,
Bothreconstruction
of these densities
a strongtopreference
for near-zero
values
but, unlike
the
noisehaisveassumed
be Gaussian,
so that the
corresponding
the Gaussian, accomodate
large values.
reconstruction
error is the squared
error. In the standard sparse coding models,
the Regarding
reconstruction
noise
is assumed
be Gaussian,
thatzero
the meas
corresponding
sparsity
sparsity,
, note
that the to
actual
v
value
alue h =so0 has
measure
ure under
reconstruction
is the
squared
error. distribution P (h | x) will not generate
b
oth densities, error
meaning
that
the posterior
Regarding
, note
that
the
value hconsidered
= 0 has zero
meas
ure under
values
h = 0. sparsity
Ho
Howev
wev
wever,
er,
sparse
co
coding
dingactual
is normally
under
a maxim
maximum
um
densities,
meaning
that the
posterior
(h x) will
not generate
aboth
posteriori
(MAP)
inference
framew
framework,
ork, distribution
in which theP inferred
values
of h are
values that
h = maximize
0. However,
coding
normally
under
um
those
thesparse
posterior,
andis these
tendconsidered
to often| be
zero aifmaxim
the prior
a posteriori
inference
framew
which the
inferred
valuesdened
of h are
is
sucien
suciently
tly(MAP)
concentrated
around
0.ork,
Theininferred
values
are those
in
those15.1,
thatrepro
maximize
Eq.
eproduced
duced hthe
ere,posterior, and these tend to often be zero if the prior
is suciently concentrated around 0. The inferred values are those dened in
Eq. 15.1, reproducedhh=
ere,
f (x) = arg min L(g(h), x)) + (
(h
h)
h = f (x) = arg min L(g(h), x)) + (h)
where L(g(h), x) is in
interpreted
terpreted as log P (x | g(h)) and (
(h
h) as log P (h). This
MAP inference view of sparse coding and an interesting probabilistic in
interpretaterpretawhere
(g(hse), coding
x) is interpreted
log
P (xin gSection
(h)) and
(h) as log P (h). This
tion
ofLspar
sparse
are furtherasdisc
discussed
ussed
19.3.
MAP
of sparse
coding
an| co
interesting
probabilistic
terpreta ho
To inference
relate theview
generativ
generative
e mo
model
delof and
sparse
coding
ding to ICA,
note
how
winthe
prior
tion
of spar
coding
arey further
ussed
in Section
19.3.
imp
imposes
oses
notsejust
sparsit
sparsity
but alsodisc
indep
ndependence
endence
of the
laten
latentt variables h under
relate
model ofdieren
sparset coexplanatory
ding to ICA,
note ho
w the PCA,
prior
P (hT),o w
hich the
ma
may
ygenerativ
help to eseparate
dierent
factors,
unlike
imposesanalysis
not justorsparsit
y ilistic
but also
indepbecause
endencethese
of therely
laten
h under
factor
probab
probabilistic
PCA,
ont vaariables
Gaussian
prior,
P
(
h
),
w
hich
ma
y
help
to
separate
dieren
t
explanatory
factors,
unlike
PCA,
whic
which
h yields a factorized prior under an
any
y rotation of the factors, multiplication
factor
analysis or probab
PCA, because
these rely
on a Gaussian prior,
b
y an orthonormal
matrix,ilistic
as demonstrated
in Section
15.6.1.
whicSee
h yields
a factorized
under anin
y terpretation
rotation of the
factors,
multiplication
Section
17.2 ab
about
outprior
the manifold
interpretation
of sparse
coding.
by an orthonormal matrix, as demonstrated in Section 15.6.1.

TODO: relate to and point to Spik


Spike-and-slab
e-and-slab sparse coding (Go
(Goo
odfellow et al.
al.,,
2012) (section?)
TODO: relate to and point to Spike-and-slab sparse coding (Goodfellow et al.,
2012) (section?)
478
478

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

15.7

Reconstruction Error as Log-Lik


Log-Likeliho
eliho
elihoo
od

Although
traditional auto-enco
auto-encoders
ders
(like traditional
neural
net
networks)
works)
15.7 Reconstruction
Error
as Log-Lik
eliho
o d were introduced with an asso
associated
ciated training loss, just like for neural netw
networks,
orks, that training
Although
traditional
ders (like traditional
neural net
were introloss can generally
beauto-enco
giv
in
as works)
a conditional
loggiven
en a probabilistic
interpretation
terpretation
duced
with
asso
ciated input
training
for neural netw
lik
original
x, loss,
giv
thelike
reprensentation
h.orks, that training
likeliho
eliho
elihoo
o d ofanthe
given
enjust
lossWcan
generally
giv
en anegativ
probabilistic
interpretation
as function
a conditional
loge ha
in general
have
ve alreadybecov
covered
ered
negative
e log-lik
log-likeliho
eliho
elihoo
o d as a loss
likeliho
od of theneural
original
input
, giv
en the6.3.2.
reprensentation
h. error for regular
for
feedforward
netw
networks
orksxin
Section
Like prediction
We ha
ve neural
alreadynet
cov
ered reconstruction
negative log-likerror
elihoofor
d as
a loss function
general
feedforw
feedforward
ard
networks,
works,
auto-enco
auto-encoders
ders do
does
esinnot
hav
havee
forbfeedforward
neural
netwworks
in Section
Likee prediction
error
regular
to
e squared error.
When
e view
the loss 6.3.2.
as negativ
negative
log-likelihoo
log-likelihood,
d, w
wefor
e interpret
feedforw
ard neural error
networks,
the
reconstruction
as reconstruction error for auto-encoders does not have
to be squared error. When we view the loss as negative log-likelihood, we interpret
the reconstruction error as
L = log P (x | h)
= h may
log P generally
(x h) be obtained through an enwhere h is the representation, L
whic
which
co

|
coder
der taking x as input.
where h is the representation, which may generally be obtained through an encoder taking x as input.

Figure 15.5: The computational graph of an auto-enco


auto-encoder,
der, which is trained to maximize
the probabilit
probability
y as
assigned
signed by the decoder g to the data point x, given the output of the
Figure
of ean
which
maximize
enco
encoder
der15.5:
h = The
f (x).computational
The training graph
ob
objectiv
jectiv
jective
is auto-enco
th
thus
us L =der,
log
P (x |isgtrained
(f (x))),towhic
which
h ends
the
probabilit
y
as
signed
b
y
the
decoder
g
to
the
data
point
x
,
given
the
output
of the
up being squared reconstruction error if we ccho
ho
hoose
ose a Gaussian reconstruction distribution
encoder
h=
ob jectiv
e ischoose
thus La =
log P (xBernoulli
g (f (x))),
which ends
with
mean
g(ff ((x
x).
)), The
and training
cross-entrop
cross-entropy
y if we
factorized
reconstruction
up being squared reconstruction error if we choose a Gaussian
reconstruction
distribution

|
with mean g(f (x)), and cross-entropy if we choose a factorized Bernoulli reconstruction

distribution with means g (f (x)).


distribution
with
means
g (f (view
x)). is that it immediately tells us what kind of loss
An adv
advan
an
antage
tage
of this

function one should use dep


depending
ending on the nature of the input. If the input is realAn and
advan
tage of this
view
is that
it immediately
tells
us what
kind of loss
valued
unbounded,
then
squared
error
is a reasonable
cchoice
hoice
of reconstruction
function one should use depending on the nature of the input. If the input is real479 is a reasonable choice of reconstruction
valued and unbounded, then squared error
479

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

error, and corresponds to P (x | h) being Normal. If the input is a vector of


bits, then cross-en
cross-entropy
tropy is a more reasonable choice, and corresponds to P (x |
error,
h) being
Normal. If the input
is a v
ectorthe
of
h) = andP (corresponds
x | h) withtoxP|(x
h being
Bernoulli-distributed.
We then
view
bits,
then
cross-en
tropy is athemore
reasonable
choice,
and corresponds
to P (i.e.,
x
| ameters
deco
g (h
) as computing
par
of the
reconstruction
distribution,
decoder
der
arameters
h
P (P
x (x | hg)(hwith
P)(x=| h) =
)). x h being Bernoulli-distributed. We then view the|
decoAnother
der g (h)adv
as| antage
computing
the
arameters
reconstruction
| pview
of this
is thatofwthe
e can
think ab
the trainingi.e.,
of
advantage
about
outdistribution,
P
(xdeco
h) der
= Pas
(x estimating
g(h)).
the
the conditional distribution P (x | h), whic
decoder
which
h comes
Another
antage
of this
view is that
we can think
aboutders,
the allo
training
of
| in theadv
|
handy
probabilistic
in
of denoising
auto-enco
interpretation
terpretation
auto-encoders,
allowing
wing us
thetalk
decoabout
der asthe
estimating
the P
conditional
distribution
P (x represen
h), whic
comes
to
distribution
(x) explicitly
or implicitly
represented
tedh by
the
handy
in
the
probabilistic
in
terpretation
of
denoising
auto-enco
ders,
allo
wing
us
|
auto-enco
auto-encoder
der (see Sections 15.9, 20.9.3 and 20.11 for more details).
In the same
to talkw
the distribution
x) explicitly
implicitly
represen
ted by the
spirit,
weeabout
can rethink
the notion Pof(encoder
from aorsimple
function
to a conditional
auto-encoderQ(see
20.9.3 case
and 20.11
or more
same
distribution
(h |Sections
x), with15.9,
a special
being fwhen
Q(details).
h | x) isIna the
Dir
Dirac
ac
at
spirit,particular
we can rethink
notion
of
encoder
from
aout
simple
function
conditional
some
value.the
Equiv
Equivalen
alen
alently
tly
tly,
, thinking
ab
about
the enco
encoder
der astoaadistribution
distribution
x), with
specialthe
case
being when
(h view
x) isis adev
Dir
ac ed
at
corresp
corresponds
onds Q
to(h
inje
injecting
cting
noisea inside
auto-enco
auto-encoder.
der. Q
This
develop
elop
eloped
some particular
value.
Equiv
tly, thinking about the enco der
| 20.9.3
| as a distribution
further
in Sections
andalen
20.12.
corresponds to injecting noise inside the auto-encoder. This view is developed
further in Sections 20.9.3 and 20.12.

15.8

Sparse Represen
Representations
tations

Sparse
are auto-encoders
whic
which
h learn a sparse represen
representation,
tation, i.e.,
15.8 auto-encoders
Sparse Represen
tations
one whose elemen
elements
ts are often either zero or close to zero. Sparse coding was inSparse
auto-encoders
are auto-encoders
whichmodel
learn ainsparse
tation,
tro
in Section 15.6.2
as a linear factor
whichrepresen
the prior
P (h)i.e.,
on
troduced
duced
one representation
whose elementsh are
either zerovalues
or close
to near
zero.0.Sparse
coding
was inthe
= foften
(x) encourages
at or
In Section
15.8.1,
we
tro
duced
in
Section
15.6.2
as
a
linear
factor
model
in
which
the
prior
P
(
h
)
on
see how ordinary auto-enco
auto-encoders
ders can b e preven
prevented
ted from learning a useless identit
identity
y
the representation
= f (axsparsit
)sparsity
encourages
values
at orthan
neara0.bottleneck.
In Section The
15.8.1,
we
transformation
by h
using
y penalty
rather
main
see how ordinary
can b e preven
fromco
learning
useless
identit
dierence
bet
between
weenauto-enco
a sparseders
auto-encoder
andted
sparse
coding
ding is athat
sparse
co
codd-y
transformation
by using
a sparsit
yp
enalty
rathersparse
than auto-enco
a bottleneck.
ing
has no explicit
parametric
enco
encoder,
der,
whereas
auto-encoders
ders The
ha
have
vemain
one.
dierence
between
a sparse
auto-encoder
and sparse
ding is the
thatappro
sparse
codThe
encoder
of sparse
coding
is the algorithm
that co
performs
approximate
ximate
ing has noi.e.,
explicit
encoder, whereas sparse auto-encoders have one.
inference,
looksparametric
for
The encoder of sparse coding is the algorithm that performs the approximate
inference, i.e., looks for
||
||x
x (b + W h)||
h (x) = arg max
maxlog
log P (h | x) = arg min
log P (h) (15.7)

x (b + W h)
h (x) = arg max log P (h x) = arg min
log P (h) (15.7)

|| (which
||
where is a reconstruction variance parameter
should
equal the av
average
erage
|

squared reconstruction error ), and P (h) is a sparse prior that puts more probwhere
is aaround
reconstruction
variance
should
equal the
average
abilit
ability
y mass
h = 0, such
as theparameter
Laplacian(which
prior, with
factorized
marginals
squared reconstruction error ), and P (h) is a sparse prior that puts more probability mass around h = 0, such as the Laplacian prior, with factorized marginals

P (h ) = 2 e
but can be lump
lumped
ed into the regularizer
P (h
dened in Eq. 15.8, for example.
but can be lumped into the regularizer
dened in Eq. 15.8, for example.

(15.8)

eh con
which
whic
controls
trols the strength of the sparsity(15.8)
prior,
)=
2
which controls the strength of the sparsity prior,

480
480

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

or the Stud
Student-t
ent-t prior, with factorized marginals
1
or the Student-t prior, with factorized marginals
P (h )
.
(15.9)
(1 + 1 )
P (h )
.
(15.9)
) and the sparse coding approac
The adv
advan
an
antages
tages of suc
such
h a non-parametric
encoder
approach
h
(1 +
over sparse auto-enco
auto-encoders
ders are that
The advantages of such a non-parametric encoder and the sparse coding approach
over
dersminimize
are that the com
1. sparse
it can auto-enco
in principle
combination
bination of reconstruction error and
log-prior better than an
any
y parametric encoder,
1. it can in principle minimize the combination of reconstruction error and
2. it
performs
whatthan
is called
explaining encoder,
away (see Figure 13.8), i.e., it allo
allows
ws
log-prior
better
any parametric
to choose some explanations (hidden factors) and inhibits the others.
2. it performs what is called explaining away (see Figure 13.8), i.e., it allows
to choose
explanations (hidden factors) and inhibits the others.
The disadv
disadvan
an
antages
tagessome
are that
The
antages
arefor
that
1. disadv
computing
time
encoding the giv
given
en inpu
inputt x, i.e., p erforming inference
(computing the represen
representation
tation h that go
goes
es with the giv
given
en x) can b e sub1. computing
time for
the given inpu
t x, i.e.,
p erforming
inference
stan
stantially
tially larger
thanencoding
with a parametric
encoder
(b
(because
ecause
an optimization
(computing
the represen
tation
h thatx),
goand
es with the given x) can b e subm
ust be performed
for each
example
stantially larger than with a parametric encoder (because an optimization
2. the
encoder
could bxe),non-smo
non-smooth
oth and possibly to
too
o nonmustresulting
be performed
forfunction
each example
and
linear (with two nearby xs being associated with very dierent hs), po2. the
resulting
encoder
function
couldforbethe
non-smo
oth andla
possibly
o nonten
tentially
tially
making
it more
dicult
do
downstream
wnstream
lay
yers to to
prop
roperly
erly
linear (with two nearby xs being associated with very dierent hs), pogeneralize.
tentially making it more dicult for the downstream layers to properly
In generalize.
Section 15.8.2, we describ
describee PSD (Predictive Sparse Decomp
Decomposition),
osition), which
com
combines
bines a non-parametric enco
encoder
der (as in sparse co
coding,
ding, with the represen
representation
tation
In
Section
15.8.2,
we
describ
e
PSD
(Predictive
Sparse
Decomp
osition),
obtained via an optimization) and a parametric encoder (like in the sparse which
autocom
bines
non-parametric
encoder
in sparse
coding, with (DAE),
the represen
enco
encoder).
der). aSection
15.9 in
intro
tro
troduces
duces
the(as
Denoising
Auto-Encoder
whic
which
htation
puts
obtained
via the
an optimization)
a parametric
in the sparse
pressure on
representationand
by requiring
it toencoder
extract(like
information
ab
the
about
outautoencoder). Section
15.9 inand
tro duces
Auto-Encoder
h puts
underlying
distribution
wherethe
it Denoising
concen
concentrates,
trates,
so as to be(DAE),
able towhic
denoise
a
pressure
on
the
representation
by
requiring
it
to
extract
information
ab
out
the
corrupted input. Section 15.10 describ
describes
es the Contractiv
Contractivee Auto-Enco
Auto-Encoder
der (CAE),
underlying
distribution
and regularization
where it concen
trates, that
so asaims
to beatable
to denoise
whic
which
h optimizes
an explicit
penalty
making
the rep-a
corrupted
Section as
15.10
describ
theinput,
Contractiv
Auto-Enco
der (CAE),
resen
resentation
tationinput.
as insensitive
possible
toesthe
while ekeeping
the information
whic
h optimizes
an explicit
regularization
penalty that aims at making the repsucien
sucient
t to reconstruct
the training
examples.
resentation as insensitive as possible to the input, while keeping the information
sucient to reconstruct the training examples.

15.8.1

Sparse Auto-Encoders

A sparse auto-enco
auto-encoder
der is simply an auto-enco
auto-encoder
der whose training criterion in
involv
volv
volves
es
Auto-Encoders
a15.8.1
sparsity Sparse
penalty (
(h
h) in additi
addition
on to the reconstruction error:
A sparse auto-encoder is simply an auto-encoder whose training criterion involves
a sparsity penalty (h) inLadditi
on Pto(xthe
= log
| greconstruction
(h)) + (
(h
h) error:
(15.10)
L=

log P (x g(h)) + (h)


481

(15.10)

481

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

where g(h) is the decoder output and typically we ha


have
ve h = f (x), the encoder
output.
where
(h) thin
is the
decoder
and
ve h = f (or
x),asthe
encoder
Wegcan
think
k of
that poutput
enalty (
(h
h) typically
simply aswea ha
regularizer
a log-prior
output.
on
the representations h. For example, the sparsit
sparsity
y penalty corresp
corresponding
onding to the
We can
thin
enalty
(h) vsimply
as a regularizer
or as
a log-prior
Laplace
prior
( ke of that
) ispthe
ab
absolute
solute
alue sparsity
penalty (see
also
Eq. 15.8
on
ab
abo
othe
ve):representations h. For example, the sparsity penalty corresponding to the
Laplace prior ( e
) is the absolute value sparsity penalty (see also Eq. 15.8
above):
(
|h |
(h
h) =
(h) =
log P (h) =

h
log
(h
h)
(15.11)
| 2| + |h | = const + (

log P (h) =
log + h = const + (h)
(15.11)
2
where the constant
depends
ends only of and
term dep
| | not h (which we typically ignore
in the training criterion because we consider as a hyperparameter rather than
the constant
term (as
depper
endsEq.
only
of the
andsparsit
not hy(which
wecorresponding
typically ignore
awhere
parameter).
Similarly
15.9),
sparsity
penalty
to
in the
trainingprior
criterion
becauseand
we Field,
consider
asis a hyperparameter rather than
the
Student-t
(Olshausen
1997)
a parameter). Similarly (as per Eq. 15.9), the sparsity penalty corresponding to
the Student-t prior (Olshausen and Field,
+ 11997) is h
(
(h
h) =
log(1 + )
(15.12)
2

+1
h
(h) =
log(1 + )
(15.12)
2

where is considered to be a hyp


yperparameter.
erparameter.
The early work on sparse auto-encoders (Ranzato et al.
al.,, 2007a, 2008) conwhere visarious
considered
e a hyp
sidered
forms to
of bsparsit
sparsity
yerparameter.
and prop
proposed
osed a connection bet
between
ween sparsit
sparsity
y
The early w
ork the
on sparse
auto-encoders
(Ranzato
et al., 2007a,models
2008) conregularization
and
partition
function gradien
gradient
t in energy-based
(see
sidered TODO).
various forms
of sparsit
prop osed
ah connection
weenit sparsit
y
Section
The idea
is thaty aand
regularizer
suc
such
as sparsity bet
makes
dicult
regularization
and
partition
function
gradient error
in energy-based
(see
for
an auto-enco
auto-encoder
derthe
to ac
achieve
hieve zero
reconstruction
ev
everywhere.
erywhere.models
If we conSection
TODO). Theerror
idea as
is athat
axyregularizer
h as sparsity log-probabilit
makes it dicult
sider
reconstruction
pro
proxy
for energysuc
(unnormalized
log-probability
y of
for an
auto-enco
der to achieve
zero reconstruction
error ev
erywhere.
If we
conthe
data),
then minimizing
the training
set reconstruction
error
forces the
energy
sider
as a pro
xy for
(unnormalized
log-probabilit
of
to
be reconstruction
lo
low
w on trainingerror
examples,
while
theenergy
regularizer
prev
preven
en
ents
ts it
from being ylo
low
w
the
data), then
minimizing
training
forces function
the energy
ev
everywhere.
erywhere.
The
same role the
is pla
play
yed by set
thereconstruction
gradien
gradientt of theerror
partition
in
to be low on training
examples,
regularizer
prevents it from being low
energy-based
mo
models
dels suc
such
h as the while
RBM the
(Section
TODO).
everywhere.
same role
is played
by theauto-encoders
gradient of thedo
partition
function
Ho
Howev
wev
wever,
er, The
the sparsit
sparsity
y penalty
of sparse
does
es not need
to ha
hav
vin
e
dels
such as the
(Section
TODO).
aenergy-based
probabilisticmo
in
interpretation.
terpretation.
ForRBM
example,
Goo
Goodfello
dfello
dfellow
w et al. (2009) successfully
wevfollowing
er, the sparsit
y penalty
auto-encoders
es noth need
to ha
vye
usedHothe
sparsity
penalt
penalty
yof, sparse
whic
which
h does
not tr
try
y todobring
all the
wa
way
adown
probabilistic
terpretation.
For example,
dfello
w et
do
wn to 0, but in
only
tow
towards
ards some
lo
low
w targetGoo
value
such
asal.
=(2009)
00..05. successfully
used the following sparsity penalty, which does not try to bring h all the way

(
(h
h) =

log h + (1 ) log
log(1
(1 h )

(15.13)

(h)with
= h =
logsigmoid(
h + (1 a ).)This
(15.13)
log(1is just
h ) the cross-en
where 0 < h < 1, usually
sigmoid(a
cross-entropy
tropy
be y p = hand the target Bernoulli
tween the Bernoulli distributions with probabilit
probability
where 0 < h with
< 1, probabilit
usually with
distribution
probability
y ph= =
. sigmoid(a ). This is just the cross-entropy between the Bernoulli distributions with probability p = h and the target Bernoulli
distribution with probability p = . 482
482

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

One wa
way
y to ac
achieve
hieve actual zer
zeros
os in h for sparse (and denoising) auto-enco
auto-encoders
ders
was in
intro
tro
troduced
duced in Glorot et al. (2011c). The idea is to use a half-rectier (a.k.a.
Oneaswarectier)
y to achieve
os in h for
sparse
(andintroduced
denoising)inauto-enco
simply
oractual
ReLUzer
(Rectied
Linear
Unit,
Glorot etders
al.
was introfor
duced
Glorot et al.
(2011c).
The
idea in
is to
useand
a half-rectier
(a.k.a.
(2011b)
deepinsupervised
netw
networks
orks and
earlier
Nair
Hin
Hinton
ton (2010a)
in
simply
as rectier)
(Rectied
Linear Unit,
introduced
Glorota et
al.
the
context
of RBMs)orasReLU
the output
non-linearit
non-linearity
y of the
encoder.inWith
prior
(2011b)
for deep
supervised
networks and
ine Nair
and Hinvton
in
that
actually
pushes
the representations
to earlier
zero (lik
(like
the absolute
alue(2010a)
penalty),
the context
RBMs) as
the output
non-linearit
y of
of zeros
the encoder.
With a prior
one
can th
thus
usofindirectly
control
the av
average
erage
num
numb
ber
in the representation.
that actually
pushes
the representations
to
(like the
absolute in
value
penalty),
ReLUs
were rst
successfully
used for de
deep
ep zero
fe
feeedforwar
dforward
d networks
Glorot
et al
al..
one can thac
ushieving
indirectly
average
numbyertooftr
zeros
in thede
representation.
(2011a),
achieving
for control
the rstthe
time
the abilit
ability
train
ain fairly
deep
ep sup
supervise
ervise
ervised
d
ReLUs were
rst successfully
for
deepd fe
ee-tr
dforwar
d networks
Glorotout
et al
networks
without
the ne
neeed forused
unsup
unsupervise
ervise
ervised
pr
pre-tr
e-training
aining
aining,
, and thisinturned
to.
(2011a),
acortant
hievingcomponent
for the rst
time2012
the ob
abilit
to train fairly
deep supervise
d
b
e an imp
important
in the
object
jecty recognition
breakthrough
with
networks
without
neorks
ed for(Krizhevsky
unsuperviseet
d al.
pre-tr
aining, and this turned out to
deep
conv
convolutional
olutionalthe
netw
networks
al.,
, 2012b).
be an
important, the
component
in the
2012
ob ject auto-enco
recognition
with
In
Interestingly
terestingly
terestingly,
regularizer
used
in sparse
auto-encoders
dersbreakthrough
does not conform
deep
olutional
networks (Krizhevsky
et al.
to
theconv
classical
in
interpretation
terpretation
of regularizers
as, 2012b).
priors on the parameters. That
Interestingly
, the regularizer
used in sparse
does(Maxim
not conform
classical
in
interpretation
terpretation
of the regularizer
comesauto-enco
from theders
MAP
(Maximum
um A
toosteriori)
the classical
of regularizers
as priors
on the parameters.
P
pointinterpretation
estimation (see
Section 5.5.1)
of parameters
asso
associated
ciated That
with
classical
interpretation
of the regularizer
comes
from and
the considering
MAP (Maxim
At
the
Ba
Bayesian
yesian
view of parameters
as random
variables
theum
join
joint
Posteriori) pof
oint
estimation
(see Section
5.5.1)
of parameters
associated with
distribution
data
x and parameters
(see
Section
5.7):
the Bayesian view of parameters as random variables and considering the joint
distribution ofarg
data
x Pand
(see
max
( |parameters
x) = arg max
(logSection
P (x | )5.7):
+ log P ( ))
maxP
arg max P ( x) = arg max (log P (x ) + log P ( ))
where the rst term on the righ
rightt is the usual data log-lik
log-likeliho
eliho
elihoo
od term and the
|
|
second term, the log-prior ov
over
er parameters, incorporates the preference over parwhere the
rstofterm
ticular
values
. on the right is the usual data log-likelihood term and the
second
term,
the log-prior
over parameters,
incorporates
the preference
over par-e
With
regularized
auto-encoders
such as sparse
auto-encoders
and contractiv
contractive
ticular
values
.
auto-enco
auto-encoders,
ders,ofinstead,
the regularizer corresponds to a lo
log-prior
g-prior over the repr
epreeWith regularized
auto-encoders
such
sparse
auto-encoders
and
sentation,
or over latent
variables
variables.. In
theas
case
of sparse
auto-enco
auto-encoders,
ders,contractiv
predictiv
predictiveee
auto-enco
ders, osition
instead,and
the con
regularizer
corresponds to athe
log-prior
over species
the represparse
decomp
decomposition
contractive
tractive auto-encoders,
regularizer
a
sentation,
over
latent variables
. In the
casethan
of sparse
ders,This
predictiv
pr
over
functions
of the data,
rather
over auto-enco
parameters.
mak
prefer
efer
eferenc
enc
encee or
makes
ese
sparse
decomposition
andendent
contractive
the regularizer
species
a
suc
such
h a regularizer
data-dep
data-dependent
endent,
, unlik
unlikeeauto-encoders,
the classical parameter
log-prior.
Specifpr
efer,enc
over
functions
of the auto-enco
data, rather
overthat
parameters.
mak
es
ically
in ethe
case
of the sparse
it says
we prefer This
an enco
ically,
auto-encoder,
der,than
encoder
der
such a regularizer data-dependent, unlike the classical parameter log-prior. Specifwhose output produces values closer to 0. Indirectly (when we marginalize over
ically
, in thedistribution),
case of the sparse
der, it asays
that weover
prefer
an encoder
the training
this isauto-enco
also indicating
preference
parameters,
of
whose
course.output produces values closer to 0. Indirectly (when we marginalize over
the training distribution), this is also indicating a preference over parameters, of

15.8.2

Predictiv
Predictive
e Sparse Decomp
Decomposition
osition

TODO: we ha
have
ve to
too
o many forward refs to this section. There are 150 lines ab
about
out
15.8.2
Predictiv
Sparse
Decomp
PSD
in this
section e
and
at least
20 linesosition
of forw
forward
ard references to this section
TODO:
e have some
too many
forward
thislines
section.
are 150e lines
about
in
this cw
hapter,
of whic
which
h are refs
justto100
awa
way
yThere
. Predictiv
Predictive
sparse
dePSD
in this (PSD)
section isand
least
20 lines
of forw
ard references
section
comp
composition
osition
a vat
arian
ariant
t that
combines
sparse
coding andtoa this
parametric
in this chapter, some of which are just 100 lines away. Predictive sparse de483
composition (PSD) is a variant that combines
sparse coding and a parametric
483

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

enco
encoder
der (Ka
(Kavukcuoglu
vukcuoglu et al.
al.,, 2008b), i.e., it has both a parametric enco
encoder
der and
iterativ
iterativee inference. It has b een applied to unsup
unsupervised
ervised feature learning for obenco
(Kavukcuoglu
et al.
, 2008b),
it has botheta al.
parametric
encoder
and
ject der
recognition
in images
and
video i.e.,
(Ka
Jarrett
(Kavukcuoglu
vukcuoglu
al.,
, 2009, 2010b;
iterativ
e inference.
It et
has
applied
to as
unsup
ervised
featureetlearning
for The
obet al.
Farab
al.
2011),
as well
for audio
(Hena
al.
al.,, 2009a;
arabet
et
al.,b, een
al.,, 2011).
ject
recognition
images and
video
(Ka
vukcuoglu
et al.,a 2009,
Jarrett
represen
to be
a free
variable
(p
laten
variable
if we
representation
tation is inconsidered
(possibly
ossibly
latentt2010b;
et
al.
, 2009a;
Farabet interpretation)
et al., 2011), asand
wellthe
as for
audiocriterion
(Hena com
et al.
, 2011).
The
cho
a probabilistic
training
a sparse
hoose
ose
combines
bines
represen
tation iswith
considered
beencourages
a free variable
(p ossibly asparse
latentrepresentation
variable if we
co
a term to
that
the optimized
coding
ding criterion
chho(after
ose a inference)
probabilistic
and theoftraining
criterion
to interpretation)
be close to the output
the enco
): bines a sparse
encoder
der f (xcom
coding criterion with a term that encourages the optimized sparse representation
h (after inference)
close
der
L = to
argbe
min
||
||x
xto
the
g(h)output
|| + |of
h| the
+ enco
||
||h
h
f (fx()x||):
(15.14)
L = arg min x g(h) + h + h f (x)
(15.14)
where f is the encoder and g is the deco
decoder.
der. Lik
Likee in sparse coding, for eac
each
h
||

||
|
|
||

||
example x an iterative optimization is p
performed
erformed in order to obtain a representawhere
is the
is the deco
Like in sparse
coding,
forofeac
h
tion
h.fHow
Howev
ev
ever,
er,encoder
b ecauseand
the giterations
cander.
be initialized
from the
output
the
example
an with
iterative
p erformed
in order
to obtain
a representaenco
encoder,
der, xi.
i.e.,
e.,
h =optimization
f (x), only aisfew
steps (e.g.
10) are
necessary
to obtain
tion
. HowevSimple
er, b ecause
thet descen
iterations
be b
initialized
from
outputAfter
of the
go
goo
odhresults.
gradien
gradient
descent
t on can
h has
been
een used b
by
y thethe
authors.
h
enco
der, i.be.,
h=
f (xupdated
), only ato
few
steps
(e.g. 10) the
are ab
necessary
to obtain
is
settled,
othwith
g and
f are
towards
wards
minimizing
abov
ov
ovee criterion.
The
goodttwo
results.
t descen
on h has
beenwhile
usedthe
by third
the authors.
After h
rst
wo
termsSimple
are thegradien
same as
in L1 tsparse
co
coding
ding
one encourages
both
and f areofupdated
towards
minimizing
the abmaking
ove criterion.
The
fis settled,
to predict
theg outcome
the sparse
co
coding
ding
optimization,
it a better
two
are the sameof
asthe
in L1
sparse optimization.
coding while the
thirdf one
crst
hoice
forterms
the initialization
iterative
Hence
can encourages
be used as
to predict the
outcome
of to
thethe
sparse
coding optimization,
making it
a better
af parametric
appro
approximation
ximation
non-parametric
encoder implicitly
dened
by
choice for
the initialization
of rst
the iterative
Hence
f can
beenc
used
as
sparse
co
coding.
ding.
It is one of the
instancesoptimization.
of le
learne
arne
arned
d appr
approximate
oximate
infer
inferenc
ence
e (see
a parametric
appro
ximation
to the
non-parametric
encoder implicitly
dened
by
also
Sec. 19.8).
Note
that this
is dieren
dierent
t from separately
doing sparse
co
coding
ding
sparsetraining
coding.gIt
is one
of the
rst instances
of learneinference
d approximate
inferenc
(see
(i.e.,
) and
then
training
an approximate
mec
mechanism
hanism
f , esince
also
19.8).
that thisare
is trained
dierenttogether
from separately
doing sparse
coeach
ding
b
oth Sec.
the enco
encoder
derNote
and decoder
to be compatibl
compatible
e with
(i.e., training
) and
then training
an approximate
hanismwill
f , since
other.
Hence gthe
decoder
will be learned
in suc
such
h a inference
way thatmec
inference
tend
both
thesolutions
encoder that
and decoder
are
togetherbto
be compatibl
e with
each
to
nd
can be w
elltrained
appro
approximated
ximated
y the
approximate
in
inference.
ference.
other. Hence
decoderto
will
be learned
such a wwhen
ay that
willthings
tend
TODO:
this isthe
probably
too
o much
forw
forward
ardinreference,
we inference
bring these
to w
nd
solutions
canthat
be w
ell appro
ximated
theit approximate
ference.
in
e can
remind that
people
they
resem
resemble
ble PSD,by
but
doesnt reallyinhelp
the
TODO:tothis
probably
toow
m
forward reference,
when to
wethings
bring they
theseha
things
reader
sa
say
y isthat
the thing
we
e uch
are describing
now is similar
havent
vent
in weyet
canAremind
that
theyvariational
resemble PSD,
but it doesnt
really
the
seen
similar people
example
is the
auto-encoder,
in whic
which
h the help
encoder
reader to say that the thing we are describing now is similar to things they havent
acts as appro
approximate
ximate inference for the decoder, and both are trained join
jointly
tly (Secseen 20.9.3).
yet A similar
example
is the
variational
auto-encoder,
in which the
tion
See also
Section
20.9.4
for a probabilistic
in
interpretation
terpretation
of encoder
PSD in
acts asofappro
ximate inference
for the
and both
terms
a variational
lo
lower
wer bound
on decoder,
the log-likelihoo
log-likelihood.
d. are trained jointly (Section 20.9.3). See also Section 20.9.4 for a probabilistic interpretation of PSD in

In practical applications of PSD, the iterative optimization is only used during


training, and f is used to compute the learned features. It mak
makes
es com
computation
putation
fastIn
atpractical
recognition
time andofalso
mak
makes
esiterative
it easy to
use the trained
f as
applications
PSD,
the
optimization
is onlyfeatures
used during
training, and (unsupervised
f is used to compute
the learned
It mak
comnet.
putation
initialization
pre-training)
for the features.
lo
lower
wer lay
layers
ers
of aesdeep
Lik
Likee
fast atunsup
recognition
andlearning
also mak
es
it easy
to use
trained
features, fe.g.,
as
other
unsupervised
ervised time
feature
sc
schemes,
hemes,
PSD
can the
be stack
stacked
ed greedily
greedily,
initialization
(unsupervised
pre-training)
for the
lower lay
a deep
training
a second
PSD on top
of the features
extracted
byers
theofrst
one,net.
etc. Like
other unsupervised feature learning schemes, PSD can be stacked greedily, e.g.,
training a second PSD on top of the features extracted by the rst one, etc.
484
484

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

15.9

Denoising Auto-Enco
Auto-Encoders
ders

The
Auto-Enco
Auto-Encoder
der (D
(DAE)
AE) was
rst prop
proposed
osed (Vincent et al.
al.,, 2008,
15.9Denoising
Denoising
Auto-Enco
ders
2010) as a means of forcing an auto-enco
auto-encoder
der to learn to capture the data distribuThe without
Denoising
der (DtAE)
was rst
osed (Vincent
et al.y, of
2008,
tion
an Auto-Enco
explicit constrain
constraint
on either
the prop
dimension
or the sparsit
sparsity
the
2010) asrepresen
a meanstation.
of forcing
an motiv
auto-enco
to learn
to capture
thetodata
learned
representation.
It was
motivated
atedder
b
by
y the
idea that
in order
fullydistribucapture
without
an explicit
constrain
t on either
theto
dimension
or theassparsit
of the
ation
complex
distrib
distribution,
ution, an
auto-encoder
needs
ha
have
ve at least
man
many
yyhidden
learned
It was motiv
the idea thatHence
in order
fully capture
units
asrepresen
needed tation.
by the complexit
complexity
y ofated
thatbydistribution.
its to
dimensionalit
dimensionality
y
a complex
ution, an
auto-encoder
needs to have at least as many hidden
should
not distrib
be restricted
to the
input dimension.
units
as principle
needed byofthe
y of
that distribution.
Hence its
dimensionalit
y
The
thecomplexit
denoising
auto-enco
auto-encoder
der is dece
deceptively
ptively
simple
and illusshould in
notFigure
be restricted
the der
input
trated
15.6: thetoenco
encoder
seesdimension.
as in
input
put a corrupted version of the input,
principletries
of the
denoising auto-enco
is deceptively
simple and illusbut The
the decoder
to reconstruct
the cleander
uncorrupted
input.
trated in Figure 15.6: the encoder sees as input a corrupted version of the input,
but the decoder tries to reconstruct the clean uncorrupted input.

Figure 15.6: The computational graph of a denoising auto-encoder, whic


which
h is trained to
reconstruct the clean data p
point
oint x from its corrupted version x
, i.e., to minimize the loss
Figure
15.6:
The
computational
graph
of
a
denoising
auto-encoder,
which isxtrained
to
L = log P (x | g(f (x
))), where x
is a corrupted v
version
ersion of the data example
, obtained
reconstruct
the clean
data ppro
ointcess
x from
version x
, i.e., to minimize the loss
its
through
a given
corruption
process
C (x
| xcorrupted
).
L = log P (x g(f (x
))), where x
is a corrupted version of the data example x, obtained
through
a given| corruption process C (x x).
Mathematically
Mathematically,, and following the notations used in this chapter, this can be
|

Mathematically, and following the notations used in this chapter, this can be

formalized as follows. We in
intro
tro
troduce
duce a corruption process C (x
| x) whic
which
h represen
sents
ts a conditional distribution over corrupted samples x
, giv
given
en a data sample
whic
) estimated
x
. The auto-enco
auto-encoder
der then
a recaonstruction
formalized
as follows.
We learns
introduce
corruption distribution
process C (xP (x)| x
h reprefrom
training
pairs
(
x
,
x

),
as
follo
follows:
ws:
sents a conditional distribution over corrupted samples x
, given a data sample
) estimated
x. The auto-encoder then learns a reconstruction distribution P|(x x
1. training
Sample apairs
training
xws:
= x from the data generating distribution (the
from
(x, x
example
), as follo
|
training set).
1. Sample a training example x = x from the data generating distribution (the
2. Sample
corrupted version x
= x from the conditional distribution C (x
|
trainingaset).
2. Sample a corrupted version x
= 485
x from the conditional distribution C (x

485

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

x = x).
x = (xx)., x) as a training example for estimating the auto-encoder reconstruc3. Use
) = P (x | g(h)) with h the output of enco
)
tion distribution P (x | x
encoder
der f (x
3. Use
as aoutput
training
for estimating the auto-encoder reconstrucand (gx(,hx))the
of example
the decoder.
) = P (x g(h)) with h the output of encoder f (x
)
tion distribution P (x x
Typically
simply
perform
gradient-based
and we
g (hcan
) the
output
of
decoder.
| the gradien
| t-based approximate minimization (such
as minibatch gradien
gradientt descent) on the negativ
negativee log-lik
log-likeliho
eliho
elihoo
od log P (x | h), i.e.,
Typically
we can
simply perform
approximate
(such
the
denoising
reconstruction
error,gradien
using t-based
back-propagation
to minimization
compute gradients,
as minibatch
gradienfeedforward
t descent) onneural
the negativ
e log-lik
oddierence
log P (xbeing
h), i.e.,
just
lik
likee for regular
netw
networks
orks
(theeliho
only
the
the denoising
reconstruction
error,
using
gradients,

|
corruption
of the
input and the
choice
of back-propagation
target output). to compute
justWlik
forview
regular
neural
orks (thestochastic
only dierence
being
the
e ecan
this feedforward
training ob
objectiv
jectiv
jective
e asnetw
performing
gradient
descent
corruption
of thereconstruction
input and the error,
choicebut
of target
output).
on
the denoising
where the
noise no
now
w has tw
two
o sources:
We can view this training ob jective as performing stochastic gradient descent
choice of
training sample
x from
the data
and now has two sources:
on 1.
thethe
denoising
reconstruction
error,
but where
the set,
noise
2.
corruption
tionsample
applied
x to
. and
x
1. the random
choice ofcorrup
training
x to
from
theobtain
data set,
2.Wthe
random
corrup
tion applied
to xDAE
to obtain
.
x
e can
therefore
consider
that the
is performing
stoc
stochastic
hastic gradien
gradientt
descen
descentt on the follo
following
wing expectation:
We can therefore consider that the DAE is performing stochastic gradient
descent on the following
E expectation:
E
log P (x | g(f (x
)))
E distribution.
E
log P (x g(f (x
)))
where Q(x) is the training

|
where Q(x) is the training distribution.

486
486

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

Figure 15.7: A denoising auto-encoder is trained to reconstruct the clean data point x
from it corrupted version x
. In the gure, we illustrate the corruption process C (x | x)
Figure
15.7:
A denoising
auto-encoder
is trained
to reconstruct
the corruption
clean data pro
point
x
b
y a grey
circle
of equiprobable
corruptions,
and grey
ar
arrow
row for the
process)
cess)
from it on
corrupted
version
. crosses)
x
In the gure,
we illustrate
the corruption
process
C (xwhic
xh)
acting
examples
x (red
lying near
a lo
low-dimensional
w-dimensional
manifold
near
which
b
y a grey ycircle
of equiprobable
and
grey arrow for
the corruption
process)
|the
probabilit
probability
concentrates.
When corruptions,
the denoising
auto-encoder
is trained
to minimize
x (red
nearreconstruction
a low-dimensional
manifold
near[xwhic
aacting
verageon
of examples
squared errors
||
||gg(crosses)
f (x)) lying
x|| , the
g (f (x))
estimates
| x],h
probabilit
y concentrates.
When
the denoising
auto-encoder
is trained
minimize the
whic
which
h appro
approximately
ximately points
orthogonally
to
towards
wards
the manifold,
since to
it estimates
the
a
verage
of
squared
errors
g
(
f
(
x

))
x
,
the
reconstruction
g
(
f
(
x

))
estimates
[
x
x
],
cen
center
ter of mass of the clean points x which could ha
have
ve given rise to x
. The auto-enco
auto-encoder
der
whic
approaximately
points
towards
manifold,
its estimates
the
|| g(orthogonally
th
thus
us hlearns
vector eld
f (x))
x|| (the
green the
arro
arrows)
ws) and itsince
turn
turns
out that | this
center of mass of the clean points x which could have given rise to x
. The auto-enco der
vector eld estimates the gradient eld
(up to a multiplicativ
multiplicativee factor that is the
us learns
vector
eldreconstruction
g(f (x)) x (the
green
arro
and
it turn
s outgenerating
that this
ath
verage
ro
root
ot amean
square
error),
where
Q ws)
is the
unkno
unknown
wn data
vector eld estimates the gradient eld
(up to a multiplicative factor that is the

distribution.
average root mean square reconstruction error), where Q is the unknown data generating
distribution.

15.9.1

Learning a Vector Field that Estimates a Gradien


Gradientt Field

15.9.1

Learning a Vector Field that Estimates a Gradient Field

As illustrated in Figure 15.7, a very imp


important
ortant propert
property
y of DAEs is that their
training criterion makes the auto-encoder learn a vector eld (g(f (x)) x) that
estimates
the gradient
eld
(or asc
scor
ore
e) important
, aspropert
p er Eq.
A rst
result
in
As illustrated
in Figure
15.7,
vor
ery
y 15.15.
of DAEs
is that
their
this
direction
wasmakes
prov
proven
enthe
by auto-encoder
Vincent (2011a),
showing
training
criterion
learnsho
a vwing
ectorthat
eldminimizing
(g(f (x)) squared
x) that
estimates the gradient
(or scorauto-enco
e)
, aswith
p er Gaussian
Eq. 15.15.noise
A rst
result
in
reconstruction
error in eld
a denoising
auto-encoder
der
was
related

thissc
direction
was (Hyv
proven
by Vincent
that minimizing
arinen,
to
scor
or
oree matching
arinen,
2005a),(2011a),
making sho
thewing
denoising
criterion a squared
regularreconstruction
error
in a denoising
auto-encosc
der
Gaussian
noise and
was LeCun,
related
ized
form of score
matching
called denoising
scor
or
oreewith
matching
(Kingma
to score Score
matching
(Hyvarinen,
2005a), making
the um
denoising
criterion
a regular2010a).
matching
is an alternativ
alternative
e to maxim
maximum
lik
likeliho
eliho
elihoo
od and pro
provides
vides a
ized
formtof
score matching
called denoising
ore matching
(Kingma
and LeCun,
consisten
consistent
estimator.
It is discussed
further inscSection
18.4. The
denoising
v
version
ersion
2010a). Score matching is an alternative to maximum likelihood and provides a
487 in Section 18.4. The denoising version
consistent estimator. It is discussed further
487

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

is discussed in Section 18.5.


The connection b
betw
etw
etween
een denoising auto-enco
auto-encoders
ders and score matching w
was
as rst
is
discussed
in Section
made
(Vincent,
2011a)18.5.
in the case where the denoising auto-encoder has a parTheparametrization
connection betw(one
een denoising
auto-enco
ders
andation
scorefunctions
matchingonwhidden
as rst
ticular
hidden lay
activ
layer,
er, sigmoid
activation
made
2011a) in theincase
where
has acorreparunits, (Vincent,
linear reconstruction),
which
casethe
the denoising
denoising auto-encoder
crit
criterion
erion actually
ticular
(one of
hidden
er,hing
sigmoid
ationRBM
functions
hidden
sp
to a regularized form
score lay
matc
on a activ
Gaussian
(withon
binomial
sponds
onds parametrization
matching
units,
in which
caseThe
theconnection
denoising crit
erion
hiddenlinear
units reconstruction),
and Gaussian visible
units).
bet
ordinary correautoetween
ween actually
sp
onds
toand
a regularized
ofhad
scorepreviously
matchingbeen
on a made
Gaussian
RBM (with
binomial
enco
Gaussian form
RBMs
by Bengio
and Delalleau
encoders
ders
hidden units
Gaussian
visible
units).
The connection
ween ordinary
auto(2009),
whic
which
hand
sho
showed
wed
that con
contrastive
trastive
div
divergence
ergence
training b
ofetRBMs
w
was
as related
tto
o
enco
ders
and
Gaussian
RBMs
had
previously
been
made
by
Bengio
and
Delalleau
an asso
associated
ciated auto-enco
auto-encoder
der gradien
gradient,
t, and later by Swersky (2010), which show
showed
ed
(2009),
which showed
that contrastive
divcorresponded
ergence training
RBMs
was
related
that
non-denoising
reconstruction
error
to of
score
matc
matching
hing
plus tao
an associated auto-encoder gradient, and later by Swersky (2010), which showed
regularizer.
thatThe
non-denoising
error yields
corresponded
to score
matc
hingfor
plus
fact th
that
at thereconstruction
denoising criterion
an estimator
of the
score
gen-a
regularizer.
eral
enco
encoder/decoder
der/decoder parametrizations has been pro
proven
ven (Alain and Bengio, 2012,
The
at the
denoising
criterion and
yields
estimator of the
score for gen2013)
infact
th
thee th
case
where
the corruption
theanreconstruction
distributions
are
eral enco der/decoder
parametrizations
hasalued),
been pro
venwith
(Alain
Bengio,
2012,
Gaussian
(and of course
x is con
continuous-v
tinuous-v
tinuous-valued),
i.e.,
the and
squared
error
de2013) inerror
the case where the corruption and the reconstruction distributions are
noising
Gaussian (and of course x is con||
tinuous-v
alued),
||g
g(f (x))
x|| i.e., with the squared error denoising error
and corruption
g(f (x)) x
;
C (x = x|x)|| = N (x
=||x, = I )
and corruption
with noise variance .C (x = x x) = (x
; = x, = I )
with noise variance .

488
488

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

Figure 15.8: Vector eld learned by a denoising auto-enco


auto-encoder
der around a 1-D curv
curved
ed manifold near whic
which
h the data (orange circles) conce
concentrates
ntrates in a 2-D space. Eac
Each
h arro
arrow
w is
Figure
15.8: V
learned by minus
a denoising
a 1-D curv
manprop
toector
the eld
reconstruction
inputauto-enco
vector ofder
thearound
auto-encoder
andedpoin
proportional
ortional
points
ts
ifold
near
whic
h
the
data
(orange
circles)
conce
ntrates
in
a
2-D
space.
Eac
h
arro
w
is
to
towards
wards higher probabilit
probability
y according to the implicitly estimated probabilit
probability
y distribution.
proportional
the reconstruction
the auto-encoder
and poin
ts
Note
that thetovector
eld has zeros minus
at bothinput
peaksvector
of theofestimated
densit
density
y function
(on
towards
probabilit
y according
tocal
theminima)
implicitly
probabilit
y distribution.
the
datahigher
manifolds)
and at
troughs (lo
(local
of estimated
that densit
density
y function,
e.g., on the
Note
vector dierent
eld has arms
zeros of
atthe
both
peaks
density function (on
curv
that the
separates
spiral
or of
in the
the estimated
middle of it.
curvee that
the data manifolds) and at troughs (local minima) of that density function, e.g., on the
curve that separates dierent arms of the spiral or in the middle of it.

More precisely
precisely,, the main theorem states that

is a consisten
consistentt estimator

More precisely, the main theorem states that

is a consistent estimator

of

, where Q(x) is the data generating distribution,


log Q(x),
g(f (x)) x
(15.15)

of
, where Q(x) is the data
generating

distribution,
x
log
Q(x)t, the true score (and(15.15)
g(f (xt ))
x to
so long as f and g ha
have
ve sucien
sucient
capacity
represen
represent
assum
x
ing that the exp
expected
ected training criterion
can be minimized, as usual when proving
so long as f and g have sucient capacity to represent the true score (and assumconsistency asso
associated
ciated with a training ob
objectiv
jectiv
jective).
e).
ing Note
that the
exp
ected
training
criterion
can
b
e
as usual when gproving
that in general, there is no guaran
that the reconstruction
(f (x))
guarantee
teeminimized,
consistency
associated
with
a training
objective).
min
x corresp
to the gradient
of something (the estimated score
minus
us the input
corresponds
onds
Notebethat
general,
is no guaran
tee thaty the
g(finput
(x))
should
the ingradien
the estimated
log-densit
withreconstruction
resp
gradient
t ofthere
log-density
respect
ect to the
minus the input x corresponds to the gradient of something (the estimated score
489 log-density with respect to the input
should be the gradient of the estimated
489

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

x). That is why the early results (Vincen


(Vincent,
t, 2011a) are specialized to particular
parametrizations where g(f (x)) x is the deriv
derivativ
ativ
ativee of something. See a more
x). That
is why the
results a(Vincen
t, 2011a) (2015).
are specialized to particular
general
treatment
by early
Kam
Kamyshansk
yshansk
yshanska
and Memisevic
parametrizations
g(f (ely
x))appealing
x is thethat
deriv
e of to
something.
See a more
Although it wwhere
as intuitiv
intuitively
inativ
order
denoise correctly
one
general
treatment
y Kamyshansk
a and Memisevic
m
ust capture
the btraining
distribution,
the abov
abovee (2015).
consistency result mak
makes
es it
Although it very
was intuitiv
appealing
that
in isorder
to denoise
correctly
one
mathematically
clear inely
what
sense the
DAE
capturing
the input
distribumust it
capture
the training
distribution,
the abov
e consistency
result
makesy),
it
tion:
is estimating
the gradien
gradient
t of its energy
function
(i.e., of its
log-densit
log-density),
mathematically
verytclear
in what
the DAE
is
the input distribui.e.,
learning to poin
point
to
towards
wards
moresense
probable
(low
(lower
er capturing
energy) congurations.
Figtion:15.8
it is(see
estimating
theexperimen
gradient tofinitsAlain
energy
of its
log-densit
y),
ure
details of
experiment
andfunction
Bengio (i.e.,
(2013))
illustrates
this.
i.e., learning
poinof
t to
wards more probable
(low
energy)
congurations.
FigNote
how theto
norm
reconstruction
error (i.e.
theernorm
of the
vectors shown
in
ure gure)
15.8 (see
details of
experimen
t infrom
Alain
Bengio
(2013)) illustrates
this.
the
is related
to but
dier
dierent
ent
theand
energy
(unnormalized
log-density)
Note
how the
norm
reconstruction
theshould
norm of
shownthe
in
asso
associated
ciated
with
the of
estimated
model.error
The (i.e.
energy
bethe
lo
low
wvectors
only where
the
gure)y is
to but
dierent from
the
energy
log-density)
probabilit
probability
is related
high. The
reconstruction
error
(norm
of (unnormalized
the estimated score
vector)
asso
ciated
the estimated
energy should
low only
where but
the
is
low
wherewith
probabilit
probability
y is nearmodel.
a p eak The
of probability
(or a be
trough
of energy),
probabilit
reconstruction
(norm
of the estimated score vector)
it
can alsoybise high.
lo
low
w atThe
maxima
of energy error
(minima
of probability).
is low
where20.11
probabilit
y is near
p eak of probability
(or a trough
of energy),
but
Section
con
continues
tinues
the adiscussion
of the relationship
bet
between
ween
denoising
it
can alsoders
b e lo
w at
maxima
of modeling
energy (minima
of probability).
auto-enco
auto-encoders
and
prob
probabilistic
abilistic
by sho
showing
wing
ho
how
w one can gener
generate
ate from
20.11implicitly
continuesestimated
the discussion
of the relationship
betWhereas
ween denoising
the Section
distribution
by a denoising
auto-enco
auto-encoder.
der.
(Alain
auto-enco
ders2013)
and generalized
probabilisticthe
modeling
by showing
how of
oneVincent
can gener
ate from
and
Bengio,
score estimation
result
(2011a)
to
the distribution
implicitly estimated
by afrom
denoising
der. Whereas
(Alain
arbitrary
parametrizations,
the result
Bengioauto-enco
et al. (2013b),
discussed
in
and Bengio,
generalized
the score
estimation
result of Vincent
(2011a) to
Section
20.11,2013)
pro
provides
vides
a probabilistic
and
in fact generative
interpretation
to
arbitrary
parametrizations,
ev
every
ery denoi
denoising
sing auto-encoder.the result from Bengio et al. (2013b), discussed in
Section 20.11, provides a probabilistic and in fact generative interpretation to
every denoising auto-encoder.

15.10

Con
Contractiv
tractiv
tractive
e Auto-Enco
Auto-Encoders
ders

The
Con
Auto-Encoder
or CAE (Rifai
et al.
Contractive
tractive
al.,, 2011a,c) introduces an ex15.10
Contractiv
e Auto-Enco
ders
plicit regularizer on the co
code
de h = f (x), encouraging the deriv
derivativ
ativ
atives
es of f to be as
The Con
small
as tractive
possible: Auto-Encoder or CAE (Rifai et al., 2011a,c) introduces an explicit regularizer on the co de h = f (x), encouraging the derivatives of f to be as
small as possible:
f (x)
(
(h
h) =
(15.16)
x
f (x)

whic
which
h is the squared Frob
robenius
enius norm (sum of squared elemen
elements)
ts) of the Jacobian
matrix of partial deriv
derivativ
ativ
atives
es associated with the enco
encoder
der function. Whereas the
denoising
auto-enco
auto-encoder
learns
contract
reconstruction
function
comwhich is the
squaredder
Frob
eniustonorm
(sumthe
of squared
elements)
of the(the
Jacobian
matrix
deriv
associated
thelearns
encoder
function.
p
ositionofofpartial
the enco
encoder
derativ
andesdeco
decoder),
der), thewith
CAE
to sp
specically
ecically Whereas
contract the
denoising
auto-enco
learns
the
function
(thepoints
comenco
encoder.
der. See
Figureder
17.13
fortoa contract
view of ho
how
w reconstruction
contraction near
the data
position of the encoder and decoder), the CAE learns to sp ecically contract the
mak
makes
es the auto-enco
auto-encoder
der capture the manifold structure.
encoIfder.
See Figure
17.13
for a view
w contractionerror,
near the
data
points
it werent
for the
opposing
force ofof ho
reconstruction
whic
which
h attempts
mak
es thethe
auto-enco
capture
manifold structure.
to
make
co
code
de hder
keep
all thethe
information
necessary to reconstruct tr
training
aining
If it w
for pthe
opposing
reconstruction
error, whic
h attempts
examples
examples,
, erent
the CAE
enalty
would force
yield of
a co
code
de h that is constan
constant
t and
do
does
es not
to make the code h keep all the information necessary to reconstruct training
490a code h that is constant and does not
examples, the CAE penalty would yield
490

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

dep
depend
end on the input x. The compromise b et
etween
ween these tw
two
o forces yields an autoenco
encoder
der whose deriv
derivatives
atives
are tiny in most directions, except those that are
depend to
on reconstruct
the input x.training
The compromise
etween
two forces
an autoneeded
examples,bi.e.,
thethese
directions
that yields
are tangent
to
encomanifold
der whosenear
deriv
atives
are tiny in most
directions,
those that(and
are
the
whic
which
h data concentrate.
Indeed,
in orderexcept
to distinguish
needed
to reconstruct
training
i.e., the
thatone
aremtangent
to
th
thus,
us, reconstruct
correctly)
tw
two
o examples,
nearb
nearby
y examples
ondirections
the manifold,
ust assign
the manifold
near
whici.e.,
h data
orderone
to to
distinguish
them
a dierent
co
code,
de,
f (x)concentrate.
must v
vary
ary asIndeed,
x mo
moves
vesinfrom
the other,(and
i.e.,
thus,
o nearb
y examples on the manifold, one must assign
in
thereconstruct
dir
direction
ection ofcorrectly)
a tangen
tangentttwto
the manifold.
them a dierent code, i.e., f (x) must vary as x moves from one to the other, i.e.,
in the direction of a tangent to the manifold.

Figure 15.9: Av
Average
erage (ov
(over
er test examples) of the singular v
value
alue sp
spectrum
ectrum of the Jacobian
matrix
for the enco
encoder
der f learned by a regular auto-encoder (AE) versus a con
contractive
tractive
Figure
15.9:
Av
erage
(ov
er
test
examples)
of
the
singular
v
alue
sp
ectrum
of
the
Jacobian
auto-enco
auto-encoder
der (CAE). This illustrates how the contractiv
contractivee regularizer yields a smaller set
matrix
forinput
the enco
der (those
f learned
by a regularto
auto-encoder
(AE)
versus
a con
tractive
of
directions in
space
corresponding
large singular
value
of the
Jacobian)
auto-enco
der
Thisinillustrates
how
the contractiv
e regularizer
yields
a smaller
set
whic
which
h prov
provok
ok
okee(CAE).
a resp
response
onse
the represen
representation
tation
h while the
represen
representation
tation
remains
almost
of directions in input space (those corresponding to large singular value of the Jacobian)
which provoke a response in the representation h while the representation remains almost

insensitiv
insensitivee for most directions of change in the input.
insensitiv
forin
most
directions
of this
change
in theforces
input.more strongly the represen
Whate is
interesting
teresting
is that
penalty
representation
tation

to be inv
invarian
arian
ariantt in directions orthogonal to the manifold. This can be seen clearly
What is interesting
is that
this
penalty of
forces
more strongly the
represen
by comparing
the singular
v
value
alue
spectrum
the Jacobian
for dieren
dierent
t tation
autoto
beders,
invarian
t in
to the
can btoe concen
seen clearly
enco
as sho
in Figureorthogonal
15.9. We see
thatmanifold.
the CAE This
manages
encoders,
shown
wndirections
concentrate
trate
by comparing
singular
value spectrum
of dimensions
the Jacobian
for dieren
autothe
sensitivit
sensitivity
y the
of the
representation
in few
fewer
er
than a regular
(or tsparse)
encoders, as shown in Figure 15.9. We see that the CAE manages to concentrate
auto-enco
auto-encoder.
der. Figure 17.3 illustrates tangen
tangentt vectors obtained by a CAE on the
the sensitivit
y of
the representation
in the
fewer
dimensions
than
a regular
(or sparse)
MNIST
digits
dataset,
sho
showing
wing that
leading
tangen
tangent
t vectors
correspond
to
auto-enco
der. Figure
17.3
tangen
t vectors
obtained
by a CAE
the
small
deformations
suc
such
h asillustrates
translation.
More
impressiv
impressively
ely
ely,, Figure
15.10 on
sho
shows
ws
MNIST
thatcolor
the (R
leading
tangen
t 10
vectors
correspond
to
tangen
tangentt digits
vectorsdataset,
learnedsho
onwing
32
32
32
(RGB)
GB) CIF
CIFARARAR-10
images
by a CAE,
small deformations such as translation. More impressively, Figure 15.10 shows
491 (RGB) CIFAR-10 images by a CAE,
tangent vectors learned on 32 32 color
491

CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS


CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS

compared to the tangent vectors by a non-distributed repr


represen
esen
esentation
tation learner (a
mixture of local PCAs).
compared to the tangent vectors by a non-distributed representation learner (a
mixture of local PCAs).

Figure 15.10: Illustration of tangen


tangentt vectors (bottom) of the manifold estimated by a
con
contractive
tractive auto-encoder (CAE), at some input point (left, CIF
CIFAR-10
AR-10 image of a dog).
Figure
15.10:
Illustration
of
tangen
t
v
ectors
(bottom)
of
the
manifold
by a
See also Fig. 17.3. Eac
ach
h image on the right corresponds to a tangen
tangenttestimated
vector, either
contractivebyauto-encoder
(CAE),
at some
input point
(left, CIF
a dog).
estimated
a local PCA (equiv
(equivalent
alent
to a Gaussian
mixture),
top,AR-10
or by aimage
CAE of
(bottom).
See also
Fig.vectors
17.3. are
Eacestimated
h image on
right corresponds
to a of
tangen
t vector, matrix
either
The
tangent
by the
the leading
singular vectors
the Jacobian
estimated
by a local PCA
(equivalent
to ough
a Gaussian
mixture),
by acan
CAE
(bottom).
of the input-to-co
input-to-code
de mappiing.
Alth
Although
both local
PCA top,
and or
CAE
capture
lo
local
cal
The
tangent
are estimated
by
the ts,
leading
singular
of ha
the
matrix
tangen
tangents
ts thatvectors
are dieren
dierent
t in dieren
dierent
t poin
points,
the local
PCAvectors
do
does
es not
hav
veJacobian
enough training
input-to-co
de mappiing.
Although
both local
PCA and
local
dataoftothe
meaningful
capture
go
whereas
the CAE
CAE can
doescapture
(because
it
goo
o d tangen
tangent
t directions,
tangen
that are dieren
t inacross
dieren
t points,lo
the
localthat
PCAshare
does anot
have of
enough
exploitstsparameter
sharing
dierent
subset
activ
hidden
locations
cations
activeetraining
data to The
meaningful
capture
good tangen
t directions,
whereas
the CAE
does (because
it
units).
CAE tangent
directions
typically
correspond
to moving
or changing
parts of
exploits
parameter
sharing
dierent
cations
thattoshare
a subset
of activ
e hidden
the
ob
object
ject
(suc
(such
h as the
headacross
or legs),
which lo
corresp
corresponds
onds
plausible
changes
in the
input
units).
The
CAE
tangent
directions
t
ypically
correspond
to
moving
or
c
hanging
parts
of
image.
the ob ject (such as the head or legs), which corresponds to plausible changes in the input
image.
One practical issue with the CAE regularization criterion is that although it

is cheap to compute in the case of a single hidden lay


layer
er auto-encoder, it becomes
One practical issue with the CAE regularization criterion is that although it

much more exp


expensiv
ensiv
ensivee in the case of deep
deeper
er auto-encoders. The strategy follo
followed
wed
by Rifai et al. (2011a) is to separately pre-train eac
each
h single-la
single-layer
yer auto-encoder
stac
stacked
ked
to form
a deeep
eeper
Howev
wev
wever,
er, a deeper
encoder
der could
be
much
more
exp ensiv
iner
theauto-encoder.
case of deeper Ho
auto-encoders.
The enco
strategy
followed
adv
advantageous
antageous
spite ofis the
computational
overhead,
argued
Sc
Schulz
hulz and
by Rifai
et al.in(2011a)
to separately
pre-train
each as
single-la
yerbyauto-encoder
stac
kede (2012).
to form a deeper auto-encoder. However, a deeper encoder could be
Behnk
Behnke
advantageous
in spiteissue
of the
computational
overhead,
as on
argued
by der
Schulz
and
Another practical
is that
the con
contraction
traction
p
penalty
enalty
the enco
encoder
f could
Behnkuseless
e (2012).
yield
results if the decoder g would exactly comp
ompensate
ensate (e.g. by being
Another practical issue is that the contraction penalty on the enco der f could
scaled up by exactly the same amoun
amountt as f is scaled do
down).
wn). In Rifai et al.
yield useless
ifensated
the decoder
g would
exactly
(e.g.
by of
being
(2011a),
this results
is comp
ompensated
by tying
the weigh
eights
ts of cfomp
andensate
g, both
being
the
scaledofup
y exactly
the same amoun
t as
scaled doywn).
Rifai et i.e.,
al.
form
an bane
transformation
follo
followed
wed
by fa isnon-linearit
non-linearity
(e.g. Insigmoid),
(2011a),
this
ompthe
ensated
byose
tying
theweigh
weightstsofofff. and g, both being of the
the
weigh
weights
ts ofisg cand
tr
transp
ansp
anspose
of the
weights
form of an ane transformation followed by a non-linearity (e.g. sigmoid), i.e.,
492 weights of f .
the weights of g and the transpose of the
492

You might also like