# Variational Autoencoders¶

**Variational Auto-Encoders** (VAE) [VAEKW13] is one of the
most widely used deep generative models.
In this tutorial, we show how to implement VAE in ZhuSuan step by step.
The full script is at
examples/variational_autoencoders/vae.py.

The generative process of a VAE for modeling binarized MNIST data is as follows:

This generative process is a stereotype for deep generative models, which starts with a latent representation (\(z\)) sampled from a simple distribution (such as standard Normal). Then the samples are forwarded through a deep neural network (\(f_{NN}\)) to capture the complex generative process of high dimensional observations such as images. Finally, some noise is added to the output to get a tractable likelihood for the model. For binarized MNIST, the observation noise is chosen to be Bernoulli, with its parameters output by the neural network.

## Build the model¶

In ZhuSuan, a model is constructed using
`BayesianNet`

, which describes a directed
graphical model, i.e., Bayesian networks.
The suggested practice is to wrap model construction into a function (
we shall see the meanings of these arguments soon):

```
import zhusuan as zs
def build_gen(x_dim, z_dim, n, n_particles=1):
bn = zs.BayesianNet()
```

Following the generative process, first we need a standard Normal
distribution to generate the latent representations (\(z\)).
As presented in our graphical model, the data is generated in batches with
batch size `n`

, and for each data, the latent representation is of
dimension `z_dim`

.
So we add a stochastic node by `bn.normal`

to generate samples of shape
`[n, z_dim]`

:

```
# z ~ N(z|0, I)
z_mean = tf.zeros([n, z_dim])
z = bn.normal("z", z_mean, std=1., group_ndims=1, n_samples=n_particles)
```

The method `bn.normal`

is a helper function that creates a
`Normal`

distribution and adds a
stochastic node that follows this distribution to the
`BayesianNet`

instance.
The returned `z`

is a `StochasticTensor`

, which
is Tensor-like and can be mixed with Tensors and fed into almost any
Tensorflow primitives.

Note

To learn more about `Distribution`

and
`BayesianNet`

. Please refer to
Basic Concepts in ZhuSuan.

The shape of `z_mean`

is `[n, z_dim]`

, which means that
we have `[n, z_dim]`

independent inputs fed into the univariate
`Normal`

distribution.
Because the input parameters are allowed to
broadcast
to match each other’s shape, the standard deviation `std`

is simply set to
1.
Thus the shape of samples and probabilities evaluated at this node should
be of shape `[n, z_dim]`

. However, what we want in modeling MNIST data, is a
batch of `[n]`

independent events, with each one producing samples of `z`

that is of shape `[z_dim]`

, which is the dimension of latent representations.
And the probabilities in every single event in the batch should be evaluated
together, so the shape of local probabilities should be `[n]`

instead of
`[n, z_dim]`

.
In ZhuSuan, the way to achieve this is by setting group_ndims`, as we do
in the above model definition code.
To help understand this, several other examples can be found in Distribution
tutorial.
`n_samples`

is the number of samples to generate.
It is None by default, in which case a single sample is generated
without adding a new dimension.

Then we build a neural network of two fully-connected layers with \(z\) as the input, which is supposed to learn the complex transformation that generates images from their latent representations:

```
# x_logits = f_NN(z)
h = tf.layers.dense(z, 500, activation=tf.nn.relu)
h = tf.layers.dense(h, 500, activation=tf.nn.relu)
x_logits = tf.layers.dense(h, x_dim)
```

Next, we add an observation distribution (noise) that follows the Bernoulli distribution to get a tractable likelihood when evaluating the probability of an image:

```
# x ~ Bernoulli(x|sigmoid(x_logits))
bn.bernoulli("x", x_logits, group_ndims=1)
```

Note

The `Bernoulli`

distribution
accepts log-odds of probabilities instead of probabilities.
This is designed for numeric stability reasons. Similar tricks are used in
`Categorical`

, which accepts
log-probabilities instead of probabilities.

Putting together, the code for constructing a VAE is:

```
def build_gen(x_dim, z_dim, n, n_particles=1):
bn = zs.BayesianNet()
z_mean = tf.zeros([n, z_dim])
z = bn.normal("z", z_mean, std=1., group_ndims=1, n_samples=n_particles)
h = tf.layers.dense(z, 500, activation=tf.nn.relu)
h = tf.layers.dense(h, 500, activation=tf.nn.relu)
x_logits = tf.layers.dense(h, x_dim)
bn.bernoulli("x", x_logits, group_ndims=1)
```

## Reuse the model¶

Unlike common deep learning models (MLP, CNN, etc.), which is for supervised
tasks, a key difficulty in designing programing primitives for generative
models is their inner reusability.
This is because in a probabilistic graphical model, a stochastic node can
have two kinds of states, **observed or latent**.
Consider the above case, if `z`

is a tensor sampled from the prior, how
about when you meet the condition that `z`

is observed?
In common practice of tensorflow programming, one has to build another
computation graph from scratch and reuse the Variables (weights here).
If there are many stochastic nodes in the model, this process will be really
painful.

We provide a solution for this. To observe any stochastic nodes,
pass a dictionary mapping of `(name, Tensor)`

pairs when constructing
`BayesianNet`

.
This will assign observed values to corresponding `StochasticTensor`

s.
For example, to observe a batch of images `x_batch`

, write:

```
bn = zs.BayesianNet(observed={"x": x_batch})
```

Note

The observation passed must have the same type and shape as the StochasticTensor.

However, we usually need to pass different configurations of observations to
the same `BayesianNet`

more than once.
To achieve this, ZhuSuan provides a new class called
`MetaBayesianNet`

to represent the meta version of `BayesianNet`

which can repeatedly produce `BayesianNet`

objects by accepting different observations.
The recommended way to construct a
`MetaBayesianNet`

is by wrapping the
function with a `meta_bayesian_net()`

decorator:

```
@zs.meta_bayesian_net(scope="gen")
def build_gen(x_dim, z_dim, n, n_particles=1):
...
return bn
model = build_gen(x_dim, z_dim, n, n_particles)
```

which transforms the function into returning a
`MetaBayesianNet`

instance:

```
>>> print(model)
<zhusuan.framework.meta_bn.MetaBayesianNet object at ...
```

so that we can observe stochastic nodes in this way:

```
# no observation
bn1 = model.observe()
# observe x
bn2 = model.observe(x=x_batch)
```

Each time the function is called, a different observation assignment is used
to construct a `BayesianNet`

instance.
One question you may have in mind is that if there are Tensorflow
Variables
created in the above function, will them be reused across these `bn`

s?
The answer is no by default, but you can enable this by switching on the
reuse_variables option in the decorator:

```
@zs.meta_bayesian_net(scope="gen", reuse_variables=True)
def build_gen(x_dim, z_dim, n, n_particles=1):
...
return bn
model = build_gen(x_dim, z_dim, n, n_particles)
```

Then `bn1`

and `bn2`

will share the same set of Tensorflow Variables.

Note

This only shares Tensorflow Variables across different
`BayesianNet`

instances generated by the same
`MetaBayesianNet`

through the
`observe()`

method.
Creating multiple `MetaBayesianNet`

objects will recreate the tensorflow Variables, for example, in

```
m = build_gen(x_dim, z_dim, n, n_particles)
bn = m.observe()
m_new = build_gen(x_dim, z_dim, n, n_particles)
bn_new = m_new.observe()
```

`bn`

and `bn_new`

will use a different set of Tensorflow
Variables.

Since reusing Tensorflow Variables in repeated function calls is a typical
need, we provide another decorator
`reuse_variables()`

for the more general cases.
Any function decorated by `reuse_variables()`

will automatically create Tensorflow Variables the first time they are called
and reuse them thereafter.

## Inference and learning¶

Having built the model, the next step is to learn it from binarized MNIST images. We conduct Maximum Likelihood learning, that is, we are going to maximize the log likelihood of data in our model:

where \(\theta\) is the model parameter.

Note

In this variational autoencoder, the model parameter is the network
weights, in other words, it’s the Tensorflow Variables created in the
`fully_connected`

layers.

However, the model we defined has not only the observation (\(x\)) but also latent representation (\(z\)). This makes it hard for us to compute \(p_{\theta}(x)\), which we call the marginal likelihood of \(x\), because we only know the joint likelihood of the model:

while computing the marginal likelihood requires an integral over latent representation, which is generally intractable:

The intractable integral problem is a fundamental challenge in learning latent variable models like VAEs. Fortunately, the machine learning society has developed many approximate methods to address it. One of them is Variational Inference. As the intuition is very simple, we briefly introduce it below.

Because directly optimizing \(\log p_{\theta}(x)\) is infeasible, we choose to optimize a lower bound of it. The lower bound is constructed as

where \(q_{\phi}(z|x)\) is a user-specified distribution of \(z\)
(called **variational posterior**) that is chosen to match the true posterior
\(p_{\theta}(z|x)\).
The lower bound is equal to the marginal log likelihood if and only if
\(q_{\phi}(z|x) = p_{\theta}(z|x)\), when the
Kullback–Leibler divergence
between them (\(\mathrm{KL}(q_{\phi}(z|x)\|p_{\theta}(z|x))\)) is zero.

Note

In Bayesian Statistics, the process represented by the Bayes’ rule

is called
Bayesian Inference,
where \(p(z)\) is called the **prior**, \(p(x|z)\) is the
conditional likelihood, \(p(x)\) is the marginal likelihood or
**evidence**, and \(p(z|x)\) is known as the **posterior**.

This lower bound is usually called Evidence Lower Bound (ELBO). Note that the only probabilities we need to evaluate in it is the joint likelihood and the probability of the variational posterior.

In variational autoencoder, the variational posterior (\(q_{\phi}(z|x)\)) is also parameterized by a neural network (\(g\)), which accepts input \(x\), and outputs the mean and variance of a Normal distribution:

In ZhuSuan, the variational posterior can also be defined as a
`BayesianNet`

. The code for above definition is:

```
@zs.reuse_variables(scope="q_net")
def build_q_net(x, z_dim, n_z_per_x):
bn = zs.BayesianNet()
h = tf.layers.dense(tf.cast(x, tf.float32), 500, activation=tf.nn.relu)
h = tf.layers.dense(h, 500, activation=tf.nn.relu)
z_mean = tf.layers.dense(h, z_dim)
z_logstd = tf.layers.dense(h, z_dim)
bn.normal("z", z_mean, logstd=z_logstd, group_ndims=1, n_samples=n_z_per_x)
return bn
variational = build_q_net(x, z_dim, n_particles)
```

Having both `model`

and `variational`

, we can build the lower bound as:

```
lower_bound = zs.variational.elbo(
model, {"x": x}, variational=variational, axis=0)
```

The returned `lower_bound`

is an
`EvidenceLowerBoundObjective`

instance, which is also Tensor-like and can be evaluated directly. However,
optimizing this lower bound objective needs special care.
The easiest way is to do
stochastic gradient descent
(SGD), which is very common in deep learning literature.
However, the gradient computation here involves taking derivatives of an
expectation, which needs Monte Carlo estimation.
This often induces large variance if not properly handled.

Note

Directly using auto-differentiation to compute the gradients of
`EvidenceLowerBoundObjective`

often gives you the wrong results.
This is because auto-differentiation is not designed to handle
expectations.

Many solutions have been proposed to estimate the gradient of some
type of variational lower bound (ELBO or others) with relatively low variance.
To make this more automatic and easier to handle, ZhuSuan has wrapped these
gradient estimators all into methods of the corresponding
variational objective (e.g., the
`EvidenceLowerBoundObjective`

).
These functions don’t return gradient estimates but a more convenient
surrogate cost.
Applying SGD on this surrogate cost with
respect to parameters is equivalent to optimizing the
corresponding variational lower bounds using the well-developed low-variance
estimator.

Here we are using the **Stochastic Gradient Variational Bayes** (SGVB)
estimator from the original paper of variational autoencoders
[VAEKW13].
This estimator takes benefits of a clever reparameterization trick to
greatly reduce the variance when estimating the gradients of ELBO.
In ZhuSuan, one can use this estimator by calling the method
`sgvb()`

of the class:~zhusuan.variational.exclusive_kl.EvidenceLowerBoundObjective
instance.
The code for this part is:

```
# the surrogate cost for optimization
cost = tf.reduce_mean(lower_bound.sgvb())
# the lower bound value to print for monitoring convergence
lower_bound = tf.reduce_mean(lower_bound)
```

Note

For readers who are interested, we provide a detailed explanation of the
`sgvb()`

estimator used here, though this is not required for you to use
ZhuSuan’s variational functionality.

The key of SGVB estimator is a reparameterization trick, i.e., they reparameterize the random variable \(z\sim q_{\phi}(z|x) = \mathrm{N}(z|\mu_z(x;\phi), \sigma^2_z(x;\phi))\), as

In this way, the expectation can be rewritten with respect to \(\epsilon\):

Thus the gradients with variational parameters \(\phi\) can be directly moved into the expectation, enabling an unbiased low-variance Monte Carlo estimator:

where \(\epsilon_i \sim \mathrm{N}(0, I)\)

Now that we have had the cost, the next step is to do the stochastic gradient descent. Tensorflow provides many advanced optimizers that improves the plain SGD, among which Adam [VAEKB14] is probably the most popular one in deep learning society. Here we are going to use Tensorflow’s Adam optimizer to do the learning:

```
optimizer = tf.train.AdamOptimizer(0.001)
infer_op = optimizer.minimize(cost)
```

## Generate images¶

What we’ve done above is to define and learn the model. To see how it
performs, we would like to let it generate some images in the learning process.
To improve the visual quality of generation, we remove the observation noise,
i.e., the `Bernoulli`

distribution.
We do this by using the direct output of the neural network (`x_logits`

):

```
@zs.meta_bayesian_net(scope="gen", reuse_variables=True)
def build_gen(x_dim, z_dim, n, n_particles=1):
bn = zs.BayesianNet()
...
x_logits = tf.layers.dense(h, x_dim)
...
```

and adding a sigmoid function to it to get a “mean” image.
After that, we add a deterministic node in `bn`

to keep track of
the Tensor `x_mean`

:

```
@zs.meta_bayesian_net(scope="gen", reuse_variables=True)
def build_gen(x_dim, z_dim, n, n_particles=1):
bn = zs.BayesianNet()
...
x_logits = tf.layers.dense(h, x_dim)
bn.deterministic("x_mean", tf.sigmoid(x_logits))
...
```

so that we can easily access it from a
`BayesianNet`

instance.
For random generations, no observation about the model is made, so we
construct the corresponding `BayesianNet`

by:

```
bn_gen = model.observe()
```

Then the generated samples can be fetched from the `x_mean`

node of
`bn_gen`

:

```
x_gen = tf.reshape(bn_gen["x_mean"], [-1, 28, 28, 1])
```

## Run gradient descent¶

Now, everything is good before a run. So we could just open the Tensorflow session, run the training loop, print statistics, and write generated images to disk:

```
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(1, epochs + 1):
time_epoch = -time.time()
np.random.shuffle(x_train)
lbs = []
for t in range(iters):
x_batch = x_train[t * batch_size:(t + 1) * batch_size]
_, lb = sess.run([infer_op, lower_bound],
feed_dict={x_input: x_batch,
n_particles: 1,
n: batch_size})
lbs.append(lb)
time_epoch += time.time()
print("Epoch {} ({:.1f}s): Lower bound = {}".format(
epoch, time_epoch, np.mean(lbs)))
if epoch % save_freq == 0:
images = sess.run(x_gen, feed_dict={n: 100, n_particles: 1})
name = os.path.join(result_path,
"vae.epoch.{}.png".format(epoch))
save_image_collections(images, name)
```

Below is a sample image of random generations from the model. Keep watching them and have fun :)

References

[VAEKW13] | (1, 2) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. |

[VAEKB14] | Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. |