Dual vs. Primal Generative Models: GANs, VAEs and Optimal Transport

Implicit generative models:
dual vs. primal approaches
Ilya Tolstikhin
MPI for Intelligent Systems, Tübingen, Germany
ilya@tue.mpg.de
Mini-Workshop, Moscow, 2017

LSUN, 128x128, Improved WGAN (Gulrajani et al., 2017)
Contents
1. Unsupervised generative modelling and implicit models

2. Distances on probability measures
3. GAN and f -GAN: minimizing f -divergences (dual formulation)
4. WGAN: minimizing the optimal transport (dual formulation)
5. VAE: minimizing the KL-divergence (primal formulation)
6. POT: minimizing the optimal transport (primal formulation)
7. Dual vs. primal: precision vs. recall? Unifying VAE and GAN
Most importantly:
WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

Contents

Most importantly:

The task:
I There exists an unknown distribution PX over the data space X
and we have an i.i.d. sample X1 , . . . , Xn from PX .
I Find a model distribution PG over X similar to PX .
We will work with latent variable models PG defined by 2 steps:

1. Sample a code Z ∼ PZ from the latent space Z;
2. Map Z to G(Z) ∈ X with a (random) transformation G : Z → X .
Z
pG (x) := pG (x|z)pz (z)dz.
Z
All techniques mentioned in this talk share two features:

I While PG has no analytical expression, it is easy to sample from;
I The objective allows for an SGD training.
Contents

Most importantly:

How to measure a similarity between PX and PG ?
I f-divergences Take any convex f : (0, ∞) → R with f (1) = 0.
Z
p(x)
Df (P kQ) := f q(x)dx
X q(x)
I Integral Probability Metrics

Take any class F of bounded real-valued functions on X .

γF (P, Q) := sup EP [f (X)] − EQ [f (Y )]
f ∈F
I Optimal transport Take any cost c(x, y) : X × X → R+ .
Wc (P, Q) := inf E(X,Y )∼Γ [c(X, Y )],

Γ∈P(X∼P,Y ∼Q)
where P(X ∼ P, Y ∼ Q) is a set of all joint distributions of (X, Y )

with marginals P and Q respectively.
Contents

Most importantly:

The goal: minimize Df (PX kPG ) with respect to PG
Variational (dual) representation of f -divergences:
EX∼P [T (X)] − EY ∼Q f ∗ T (Y )

Df (P kQ) = sup
T : X →dom(f ∗ )
where f ∗ (x) := supu x · u − f (u) is a convex conjugate of f .
If G : Z → X is deterministic, solving inf PG Df (PX kPG ) is equivalent to
inf sup EX∼PX [T (X)] − EZ∼PZ f ∗ T (G(Z))

(*)
G T
1. Estimate expectations with samples:

N M
1 X 1 X ∗
≈ inf sup T (Xi ) − f T (G(Zj )) .
G T N i=1 M j=1
2. Parametrize T = Tω and G = Gθ using any flexible functions

(eg. deep nets) and run (minibatch) SGD on (*).
(Nowozin et al. f-GAN: Training generative neural samplers using variational divergence minimization, 2016.)
The goal: minimize Df (PX kPG ) with respect to PG
EX∼P [T (X)] − EY ∼Q f ∗ T (Y )

Df (P kQ) = sup
T : X →dom(f ∗ )
If G : Z → X is deterministic, solving inf PG Df (PX kPG ) is equivalent to
inf sup EX∼PX [T (X)] − EZ∼PZ f ∗ T (G(Z))

(*)
G T
1. Estimate expectations with samples:

N M
1 X 1 X ∗
≈ inf sup T (Xi ) − f T (G(Zj )) .
G T N i=1 M j=1
2. Parametrize T = Tω and G = Gθ using any flexible functions

(eg. deep nets) and run (minibatch) SGD on (*).
(Nowozin et al. f-GAN: Training generative neural samplers using variational divergence minimization, 2016.)
Original Generative Adversarial Networks
EX∼P [T (X)] − EZ∼PZ f ∗ T (G(Z))

Df (PX kPG ) = sup
T : X →dom(f ∗ )
∗
1. Take f (x) = −(x + 1) log x+1 t

2 + x log x and f (t) = − log 2 − e .
The domain of f ∗ is (−∞, log 2);
2. Take T = gf ◦ Tω , where gf (v) = log 2 − log(1 + e−v );
Up to additive 2 log 2 term inf PG Df (PX kPG ) is equivalent to
!
1 1
inf sup EX∼PX log + EZ∼PZ log 1 −
1 + e−Tω (X)

Gθ Tω
1 + e−Tω Gθ (Z)
Compare to the original GAN objective (Goodfellow et al. Generative adversarial nets, 2014.)

inf sup EX∼Pd [log Tω (X)] + EZ∼PZ [log 1 − Tω (Gθ (Z)) ].
Gθ Tω
Theory vs. practice: do we know what GANs do?
EX∼P [T (X)] − EZ∼PZ f ∗ T (G(Z))

Df (PX kPG ) = sup
T : X →dom(f ∗ )

inf sup EX∼Pd [log Tω (X)] + EZ∼PZ [log 1 − Tω (Gθ (Z)) ].
Gθ Tω
GANs are not precisely solving inf PG JS(PX kPG ), because:

1. GANs replace expectations with sample averages. Uniform lows of
large numbers may not apply, as our function classes are huge;
2. Instead of taking supremum over all possible witness functions T
GANs optimize over classes of DNNs;
3. In practice GANs never optimize Tω “to the end” because of various
computational/numerical reasons.
A possible criticism of f -divergences
I When PX and PG are supported on disjoint manifolds f -divergences
often max out.
I This leads to numerical instabilities: no useful gradients for G.
I Consider PG0 and PG00 supported on manifolds M 0 and M 00 .
Suppose d(M 0 , MX ) < d(M 00 , MX ), where MX is the true
manifold. f -divergences will often give the same numbers.
Possible solutions:
1. The smoothing: add a noise to both PX and PG before comparing:
X ∼ PX , Y ∼ PG , ∼ P 0 → X + , Y + .
2. Use other weaker divergences, including IPMs and the optimal

transport.
(Arjovsky, Chintala, Bottou, 2017) , (Arjovsky, Bottou, 2017) , (Roth et al., 2017)
Minimizing MMD between PX and PG
I Take any reproducing kernel k : X × X → R. Let Bk be a unit ball
of the corresponding RKHS Hk .
I Maximum Mean Discrepancy is the following IPM:
γk (PX , PG ) := sup |EPX [T (X)] − EPG [T (Y )]|, (MMD)
T ∈Bk
which is a metric if the kernel k is characteristic

I This optimization problem has a closed form analytical solution
γk (PX , PG ) := kEX∼PX k(X, ·) − EY ∼PG k(Y, ·)kHk
One can play the adversarial game using (MMD) instead of Df (PX kPG ):
I No need to train the discriminator T ;

I On the other hand, Bk is a rather restricted class;
I One can also train k adversarially, resulting in a stronger objective:
inf max γk (PX , PG ).
PG k
(Gretton et al., 2012) , (Li et al., 2015) , (Dziugaite, Ghahramani, 2015) , (Li et al., 2017)
Contents

Most importantly:

Minimizing the 1-Wasserstein distance
1-Wasserstein distance is defined by
W1 (P, Q) := inf E(X,Y )∼Γ [d(X, Y )],

Γ∈P(X∼P,Y ∼Q)
where P(X ∼ P, Y ∼ Q) is a set of all joint distributions of (X, Y ) with

marginals P and Q respectively and (X , d) is a metric space.
Kantorovich-Rubinstein duality:
W1 (P, Q) = sup |EPX [T (X)] − EPG [T (Y )]|, (KR)

T ∈FL
where FL are all the bounded 1-Lipschitz functions on (X , d).

Wasserstein GAN (WGAN):
I In order to solve inf PG W1 (PX , PG ) let’s play the adversarial
training card on (KR). Parametrize T = Tω using the weight
clipping or perform the gradient penalization.
I Unfortunately, (KR) holds only for the 1-Wasserstein distance.
(Arjovsky, Chintala, Bottou, 2017) , (Gulrajani et al., 2017)
Contents

Most importantly:

VAE: Maximizing the marginal log-likelihood
inf KL(PX kPG ) ⇔ inf −EPX [log pG (X)].
PG PG
Variational upper bound: for any conditional distribution Q(Z|X)

−EPX [log pG (X)] = EPX KL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)]

− EPX KL Q(Z|X), PG (Z|X)

≤ EPX KL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)] .
In particular, if Q is not restricted:

−EPX [log pG (X)] = inf EPX KL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)]
Q

≤ inf EPX KL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)]
Q∈Q
Variational Auto-Encoders use the upper bound and

I Latent variable models with any PG (X|Z), eg. N (X; G(Z), σ 2 · I)
I Set PZ (Z) = N (Z; 0, I) and Q(Z|X) = N (Z; µ(X), Σ(X))
I Parametrize G = Gθ , µ, and Σ with deep nets. Run SGD.
AVB: reducing the gap in the upper bound
Variational upper bound:

−EPX [log pG (X)] ≤ inf EPX KL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)]
Q∈Q
Adversarial Variational Bayes reduces the variational gap by

I Allowing for flexible encoders Qe (Z|X), defined implicitly by
random variables e(X, ), where ∼ P ;
I Replacing the KL divergence in the objective by the adversarial
approximation (any of the ones discussed above)
I Parametrize e with a deep net. Run SGD.
Downsides of VAE and AVB:
I Literature reports blurry samples. This is caused by the combination
of KL objective and the Gaussian decoder.
I Importantly, PG (X|Z) is trained only for encoded training points,
i.e. for Z ∼ Q(Z|X) and X ∼ PX . But we sample from Z ∼ PZ .
(Kingma, Welling, 2014) , (Mescheder et al., 2017) , (Bousquet et al., 2017)
Unregularized Auto-Encoders
Variational upper bound:

−EPX [log pG (X)] ≤ inf EPX KL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)]
Q∈Q
I The KL term in the upper bound may be viewed as a regularizer;

I Dropping it results in classical auto-encoders, where the
encoder-decoder pair tries to reconstruct all training images;
I In this case training images X often end up being mapped to
different spots chaotically scattered in the Z space;
I As a result, Z captures no useful representations. Sampling is hard.
Contents

Most importantly:

Minimizing the optimal transport
Optimal transport for a cost function c(x, y) : X × X → R+ is
Wc (PX , PG ) := inf E(X,Y )∼Γ [c(X, Y )],
Γ∈P(X∼PX ,Y ∼PG )
If PG (Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have

Wc (PX , PG ) = inf EPX EQ(Z|X) c X, G(Z) ,
Q : QZ =PZ
where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).

Minimizing the optimal transport
Optimal transport for a cost function c(x, y) : X × X → R+ is
Wc (PX , PG ) := inf E(X,Y )∼Γ [c(X, Y )],
Γ∈P(X∼PX ,Y ∼PG )
If PG (Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have

Q : QZ =PZ
where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).

Relaxing the constraint

Q : QZ =PZ
Penalized Optimal Transport replaces the constraint with a penalty:

POT(PX , PG ) := inf EPX EQ(Z|X) c X, G(Z) + λ · D(QZ , PZ )
Q
and uses the adversarial training in the Z space to approximate D.

I For the 2-Wasserstein distance c(X, Y ) = kX − Y k22 POT recovers
Adversarial Auto-Encoders (Makhzani et al., 2016)
I For the 1-Wasserstein distance c(X, Y ) = kX − Y k2 POT and
WGAN are solving the same problem from the primal and dual
forms respectively.
I Importantly, unlike VAE, POT does not force Q(Z|X = x) to
intersect for different x, which is known to lead to the blurriness.
(Bousquet et al., 2017)
Contents

Most importantly:

I GANs approach the problem from a dual perspective.
I They are known to produce very sharply looking images.
max EZ∼PZ [T ∗ (G(Z))]

G
I But notoriously hard to train, unstable (although many would

disagree), and sometimes lead to mode collapses.
I GANs come without an encoder.
(Radford et al., 2015) aka DCGAN, 32X32 ImageNet
(Salimans et al., 2016) aka Improved GAN, 128X128 ImageNet
(Gulrajani et al., 2017) aka Improved WGAN, 32X32 CIFAR-10
(Li et al., 2017) aka MMD-GAN, 32X32 CIFAR-10
(Roth et al., 2017) aka Stabilized GAN, 32X32 CIFAR-10
(Radford et al., 2015) aka DCGAN, 64X64 LSUN
(Radford et al., 2015) aka DCGAN, CelebA
(Metz et al., 2017) aka Unrolled GAN, CIFAR-10
I VAEs approach the problem from its primal.
I They enjoy a very stable training and often lead to diverse samples.

max EX∼PX EZ∼Q(Z|X) [c X, G(Z)
G
I But the samples look blurry

I VAEs come with encoders.
Various papers are trying to combine a stability and recall of VAEs with
the precision of GANs:
I Choose an adversarially trained cost function c;
I Combine AE costs with the GAN criteria;
I ...
(Mescheder et al., 2017) aka AVB, CelebA
VAE trained on CIFAR-10, Z of 20 dim.
VAE trained on CIFAR-10, Z of 20 dim.
(Bousquet et al., 2017) aka POT, CIFAR-10, same architecture
(Bousquet et al., 2017) aka POT, CIFAR-10, test reconstruction
Contents

Most importantly:

Literature
1. Gretton et al. A kernel two-sample test, 2012.

2. Kingma, Welling. Auto-encoding variational Bayes, 2014.
3. Nowozin, Cseke, Tomioka. f-GAN: Training generative neural samplers using variational divergence
minimization, 2016.
4. Goodfellow et al. Generative adversarial nets, 2014.
5. Arjovsky, Chintala, Bottou. Wasserstein GAN, 2017.
6. Arjovsky, Bottou. Towards Principled Methods for Training Generative Adversarial Networks, 2017.
7. Gulrajani et al. Improved Training of Wasserstein GANs, 2017.
8. Li et al. Generative Moment Matching Networks, 2015.
9. Li et al. MMD GAN: Towards Deeper Understanding of Moment Matching Network, 2017.
10. Dziugaite, Roy, Ghahramani. Training generative neural networks via maximum mean discrepancy
optimization, 2015.
11. Kingma, Welling. Auto-encoding variational Bayes, 2014.
12. Makhzani et al. Adversarial autoencoders, 2016.
13. Mescheder, Nowozin, Geiger. Adversarial variational bayes: Unifying variational autoencoders and
generative adversarial networks, 2017.
14. Roth et al. Stabilizing Training of Generative Adversarial Networks through Regularization, 2017.
15. Bousquet et al. From optimal transport to generative modeling: the VEGAN cookbook, 2017.

Dual vs. Primal Generative Models: GANs, VAEs and Optimal Transport

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dual vs. Primal Generative Models: GANs, VAEs and Optimal Transport

Uploaded by

Copyright:

Available Formats

Implicit generative models:

dual vs. primal approaches

Mini-Workshop, Moscow, 2017

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

We will work with latent variable models PG defined by 2 steps:

All techniques mentioned in this talk share two features:

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

I Integral Probability Metrics

I Optimal transport Take any cost c(x, y) : X × X → R+ .

Wc (P, Q) := inf E(X,Y )∼Γ [c(X, Y )],

where P(X ∼ P, Y ∼ Q) is a set of all joint distributions of (X, Y )

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

where f ∗ (x) := supu x · u − f (u) is a convex conjugate of f .

If G : Z → X is deterministic, solving inf PG Df (PX kPG ) is equivalent to

inf sup EX∼PX [T (X)] − EZ∼PZ f ∗ T (G(Z))

1. Estimate expectations with samples:

2. Parametrize T = Tω and G = Gθ using any flexible functions

where f ∗ (x) := supu x · u − f (u) is a convex conjugate of f .

If G : Z → X is deterministic, solving inf PG Df (PX kPG ) is equivalent to

inf sup EX∼PX [T (X)] − EZ∼PZ f ∗ T (G(Z))

1. Estimate expectations with samples:

2. Parametrize T = Tω and G = Gθ using any flexible functions

EX∼P [T (X)] − EZ∼PZ f ∗ T (G(Z))

where f ∗ (x) := supu x · u − f (u) is a convex conjugate of f .

Variational (dual) representation of f -divergences:

EX∼P [T (X)] − EZ∼PZ f ∗ T (G(Z))

where f ∗ (x) := supu x · u − f (u) is a convex conjugate of f .

GANs are not precisely solving inf PG JS(PX kPG ), because:

2. Use other weaker divergences, including IPMs and the optimal

which is a metric if the kernel k is characteristic

I No need to train the discriminator T ;

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

W1 (P, Q) := inf E(X,Y )∼Γ [d(X, Y )],

where P(X ∼ P, Y ∼ Q) is a set of all joint distributions of (X, Y ) with

W1 (P, Q) = sup |EPX [T (X)] − EPG [T (Y )]|, (KR)

where FL are all the bounded 1-Lipschitz functions on (X , d).

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

Variational upper bound: for any conditional distribution Q(Z|X)

In particular, if Q is not restricted:

Variational Auto-Encoders use the upper bound and

Adversarial Variational Bayes reduces the variational gap by

Variational upper bound:

I The KL term in the upper bound may be viewed as a regularizer;

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

If PG (Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have

where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).

If PG (Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have

where QZ is the marginal distribution of Z when X ∼ PX , Z ∼ Q(Z|X).

Penalized Optimal Transport replaces the constraint with a penalty:

and uses the adversarial training in the Z space to approximate D.

1. Unsupervised generative modelling and implicit models

WE NEED AN ADEQUATE WAY TO EVALUATE THE MODELS

max EZ∼PZ [T ∗ (G(Z))]

I But notoriously hard to train, unstable (although many would

I But the samples look blurry