#
14 Mar 2018
Density Estimation: Variational Autoencoders

One of the most popular models for density estimation is the Variational Autoencoder. It is a model that I have spent a considerable amount of time working with, so I want to give it an especially in-depth treatment. In this post, we shall construct the VAE ground-up from statistical principles. We will go over the key concepts that underpin the VAE model: maximum likelihood estimation, latent variable models, variational inference, and parameter optimization. Only at the very end will we introduce neural networks as a means of parameterizing conditional distributions.

### Maximum Likelihood Estimation

Maximum likelihood estimation finds the distribution $p \in \P_x$ that best approximates $\hat{p}$ under the Kullback-Leibler divergence. The selected $p_\text{MLE}$ is sometimes interpreted as the projection of $\hat{p}$ onto $\P_x$.

We are interested in estimating a true distribution $p^*(x)$. To do so, we first introduce a set of set of candidate distributions $\P_x$. Each $p \in \P_x$ is distribution that defines a density over $x$. Next, although we do not have access to $p^*(x)$, we do have access to finite samples from it. We denote uniform sampling from this finite dataset as $\hat{p}(x)$. The goal of maximum likelihood estimation is to find the best $p \in \P_x$ to approximate $\hat{p}(x)$ as measured by the Kullback-Leibler divergence,

Note that minimizing the KL divergence is equivalent to maximizing the log likelihood.

### Latent Variable Models

We call $\P_x$ the hypothesis space of generative models. Given the maximum likelihood problem formulation, we wish for our hypothesis space $\P_x$ to exhibit at least two desirable traits

- Any hypothesis $p \in \P_x$ must have a tractable-to-compute density, since our objective requires that we compute $p(x)$.
- $\P_x$ should be expressive. Ideally, we want $p^*$ to be contained within the distribution family $\P_x$. Since we don’t know what $p^*$ is a priori, it is desirable to make $\P_x$ sufficiently flexible—but not so flexible that it includes $\hat{p}$ (i.e. memorization).

Unfortunately, the first and second conditions are typically at odds with each other. Generally speaking, very few distributions have densities that are computationally tractable. This puts a severe constraint on the flexibility of $\P_x$, restricting it to only the set of “simple” distributions.

A Gaussian Mixture family $\P^{(2)}_x$ is more expressive than a Gaussian family $\P^{(1)}_x$ and is thus better at approximating the empirical distribution. Formally, since $\P^{(1)}_x \subseteq \P^{(2)}_x$, it follows that $\KL{\hat{p}}{p^{(1)}_\text{MLE}} \ge \KL{\hat{p}}{p^{(2)}_\text{MLE}}$, where $p^{(1)}_\text{MLE}$ is the projection onto the Gaussian family, and $p^{(2)}_\text{MLE}$ is the the projection onto the Gaussian Mixture family.

One of the many motivations of a latent variable model is to introduce latent variables so that the joint distribution can be defined as a product of simple distributions. Take, for example, the well-known distribution family: a mixture of $k$ Gaussians. We introduce the discrete latent variable $z$ which can take on values in $\set{1, \ldots, k}$ according to the probability vector $\pi$ and a conditional Gaussian distribution parameterized by $\set{\mu_z, \Sigma_z}_{z = 1}^k$. Note that the joint distribution $p(x, z)$ has a tractable density computed as the product of a categorical and a Gaussian density

and for which the density over $x$ is

Despite the decomposition into two fairly simple distributions, the density $p(x)$ is achieved through the marginalization of the latent variable $z$, yielding a distribution family (the Gaussian mixture family) over $x$ that is strictly more expressive than the Gaussian family. As such, one can think of latent variable models as leveraging auxiliary variables to express more complex distributions.

A graphical representation of our latent variable model, where $x$ is our observed variable and $z$ is our latent variable. Training this model via maximum likelihood entails identifying the best distribution $p(z) \in \P_z$ and conditional distribution $p(x \giv z) \in \P_{x \giv z}$ such that marginal $p(x)$ minimizes the KL-divergence to the empirical $\hat{p}(x)$.

Given the flexibility of latent variable models, we shall instead train a latent variable model. To do so, we construct $\P_{x,z} = \P_z \times \P_{x \giv z}$, where $\P_z$ is a set of simple distributions over $z$ and $\P_{x\giv z}$ is a set of simple conditional distributions over $x$ (given $z$). As such, each $(p_z, p_{x \giv z}) \in \P_{x,z}$ defines a density jointly over the observed variable $x$ and the latent variable $z$. For notational simplicity, we shall in the remaining text refer to $p \in \P$ as a hypothesis that defines the joint distribution, and denote $p(z) = p_z(z)$, $p(x \giv z) = p_{x \giv z}(x \giv z)$.

### Intractable Marginal and the Variational Principle

Although we have deliberately chosen $\P$ such that the joint distribution has a tractable density, note that the computation of $p(x)$ requires marginalization of the latent variable $z$. If $z$ is a continuous variable, then marginalization takes the form

This marginalization is potentially very difficult to compute. To get around this issue, we rely on the variational principle: the technique of converting a difficult computation problem into an optimization problem. Fortunately it is easy to construct a lower bound on the likelihood $p(x)$ by introducing a family of conditional distributions $\Q$. The family $\Q$ is called the variational family and its elements $q(z \giv x) \in \Q$ define a distribution over $z$ conditioned on $x$. Using Jensen’s inequality, note that for any $q \in Q$,

where $(a)$ denotes the application of Jensen’s Inequality. This lower bound is called the variational lower bound. Rather than computing $\ln p(x)$ directly by marginalization, we instead optimize the variational lower bound over $q \in \Q$. To provide an intuition for what optimizing over $q \in \Q$ does, note that there is an alternative formulation of this lower bound. By applying to chain rule $p(x, z) = p(x)p(z \giv x)$, we see that optimizing the variational lower bound is equivalent to

This shows that optimizing the lower bound is equivalent to finding $q \in \Q$ that best approximates the posterior $p(z \giv x)$. This exposes the interesting connection between marginalization and posterior inference, and is also why maximizing the variational lower bound is called performing variational inference.

The variational objective suggests that rather than maximizing the log likelihood directly, we instead maximize its variational lower bound, which entails estimating the posterior $p(z \giv x)$ with the variational approximation $q(z \giv x)$.

Rather than maximizing the log likelihood directly, we now solve the optimization problem

Note that in doing so, we not only learn to approximate $p^*(x)$, but we also simultaneously figure out how to best approximate the posterior of our latent variable model $p(z)p(x \giv z)$.

### Parametric Models for Distributions

We consider the case of unconditional and conditional parametric distributions, where the distribution $p(x \scolon \gamma)$ is completely defined by parameters $\gamma$. To represent a conditional distribution $p(x \giv z)$, simply construct the distribution $p(x \scolon \gamma(z))$ where $\gamma$ is a function of $z$.

So far, we’ve been talking about distributions and conditional distributions in the abstract. To implement variational autoencoders in practice, we will instantiate these abstractions with actual distributions. For the sake of implicity, we shall restriction $\P_z$ to contain only the unit Gaussian distribution

Note that a diagonal Gaussian distribution is always defined by its mean and standard deviation parameters $\gamma = (\mu, \sigma)$. This provides a means of constructing a conditional diagonal Gaussian distribution: simply make $\mu$ and $\sigma$ functions of $z$. In other words, we can build the observation model $p(x \giv z)$ as

Now, $\mu_\theta$ and $\sigma_\theta$ with function parameters $\theta$. If $\mu_\theta$ and $\sigma_\theta$ are neural networks, then $\theta$ denotes the weight of the neural network. Similarly, we can build the inference model as

whose neural network parameters are $\phi$. Since we parameterized the conditional distributions of our latent variable model with neural networks, we call our model a deep latent variable model. Note that the conditional form of *any* parametric distribution can be constructed by converting the distribution parameters $\gamma$ into functions $\gamma(z)$. As such, one does not need to use a conditional Gaussian distribution for their observation or inference model. This post, however, shall only consider conditional Gaussian models.

### Reparameterization Trick

To train our deep latent variable model, our original optimization problem

takes on a more concrete form

To optimize this objection, we shall—in the tradition of deep learning models—rely on stochastic gradient descent. This presents a little bit of a challenge, however, since we will need to compute the gradient

Since the expectation depends on $\phi$, we cannot push the gradient through the expectation. To resolve this issue, we need a way to sample from $q_\phi(z \giv x)$ but, at the same time, not have the expectation depend on $q_\phi(z \giv x)$. To do so, we rely on the trick of transformation-based sampling, where we sample $\eps$ from a fixed distribution but come up with a transformation function $T$ such that sampling $T(\eps \mathbin{;} \phi, x)$ is equivalent to sampling from $q_\phi(z \giv x)$. In other words, we need to find an appropriate choice of $p(\eps)$ and $T(\eps)$ such that

where the dependency of $T$ on $(\phi, x)$ is omitted for notational simplicity. Fortunately, when $q_\phi(z \giv x)$ is a conditional diagonal Gaussian model, a simple choice of $p(\eps)$ and $T$ exists

For non-Gaussian conditional models, one will need to work out an appropriate choice of $(p(\eps), T)$ from scratch.

The computational graph for the training of our deep latent variable model exposes its connection to autoencoders.

The transformation-based sampling also exposes the connection between autoencoders and deep latent variable models optimized via the variational lower bound. The forward pass of the computational graph during training includes the path

- Compute $\mu_\phi(x)$ and $\sigma_\phi(x)$.
- Sample $z = \mu_\phi(x) + \eps \odot \sigma_\phi(x)$, where $z$ can be interpreted as the stochastic embedding.
- Compute $\mu_\theta(z)$ and $\sigma_\theta(z)$, where $\mu_\theta(x)$ can be interpreted as $\hat{x}$.

This motivates our final reformulation of the variational objective. By applying to chain rule $p_\theta(x, z) = p(z)p_\theta(x \giv z)$, we see that the variational lower bound is equivalent to

The first term $\Expect_{q_\phi(z \giv x)} \ln p_\theta(x \giv z)$ can be interpreted as the autoencoding (reconstruction) objective. The second term regularizes $q_\phi(z \giv x)$ to be close to the prior $p(z)$. This introduces a tug-of-war effect: to get better reconstructions, the embedding means $\mu_\phi(x)$ are pushed far away from each other and embedding standard deviations $\sigma_\phi(x)$ are pulled toward zero; to get smaller $\KL{q_\phi(z \giv x)}{p(z)}$, the embedding means are pulled toward zero and the embedding standard deviations are pulled toward one.

An idealized representation of four samples being stochastically encoded into the space of $z$. REC denotes the reconstruction cost, while KL denotes the KL-divergence to the prior. When there is little overlap over the stochastic embeddings, the reconstruction cost is low but the KL-divergence is high. When the KL-divergence is low, there is too much overlap between the embeddings and thus reconstruction cost is high.

This autoencoding reformulation of our deep latent variable model objective is what inspires the model’s name as the Variational Autoencoder. We stress, however, that there is value in the Jensen’s Inequality formulation

which informs how to construct hierarhical deep latent variable models, as well as the posterior divergence formulation

which exposes the connection between the variational distribution and the generative model’s true posterior.

### Implementation: Weird Tricks and Their Theoretical Underpinnings

There is an abundance of variational autoencoder implementations on the web, so I won’t belabor the details of implementing a VAE too much. If you want to see my implementation of a standard VAE in its entirely, simply check out this Github repository. For the sake of completeness, I will highlight certain portions of the code below to indicate how it relates to the theory of VAE that we have developed thus far. Most importantly, I will identify certain tricks that VAE practitioners use to ensure the numerical stability of their model and discuss the theoretical significance of these tricks.

First and foremost, let’s construct the neural networks that parameterize our conditional distributions $p_\theta(x \giv z)$ and $q_\phi(z \giv x)$, which I refer to respectively as the generator and encoder. Here, I use a pretty straightforward parameterization: the encoder uses a series of convolutional layers, and the generator a series of transposed convolutional layers.

#### Lower bounding the Inference Model Variance

The encoder then uses two separate heads to propose the mean and variance parameters of the conditional distribution. Here, I compute the variance directly, using a softplus activation. A popular alternative is to compute the standard deviation, using the exponential activation. Notice that I add a fudge factor `1e-5`

to the variance. This fudge factor is there to impose a lower bound on the variance achievable by $q_\phi(z \giv x)$ and is introduced for numerical stability: if $q_\phi(z \giv x)$ is allowed to have arbitrarily small variance, the density tends toward infinity. Not imposing a lower bound on the variance of the inference model is a most common cause of `nan`

’s in VAE implementations.

#### Lower bounding the Generative Model Variance

The next thing to note is that our generator only returns the mean parameter $\mu_\theta(z)$ but not the variance parameter $\sigma^2_\theta(z)$ of our observation model $p_\theta(x \giv z) = \Normal(x \giv \mu_\theta(z), \diag(\sigma^2_\theta(z)))$. This is because we will simply set to the variance globally to $1$. One is is rightfully concerned and should wonder why we deliberately chose *not* to learn the variance parameter of our obseravtion. Setting the variance is $1$ is especially incredible; if our dataset consists of images with normalized pixel intensities in $[-1, 1]$, using a variance of $1$ adds a remarkable amount of noise when sampling from $p_\theta(x \giv z)$. The dirty secret when using Gaussian observation model VAEs is that no one actually samples from $p_\theta(x \giv z)$, but instead directly report $\mu_\theta(z)$ as “samples” from the generator. This is by far the dirtiest and most prevalent trick that practitioners employ when using Gaussian observation model VAEs. There is, to some extent, good theoretical justification for this practice. First, one should note that for the same reason as mentioned in the previous paragraph, there is value in lower bounding the variance of the observation model. This is especially critical for the observation model as the variance directly governs the *weighting* of the $\ell_2$-reconstruction loss, since

If your model has sufficient capacity such that there exists $(\theta, \phi)$ for which $\mu_\theta(z)$ provides a sufficiently good reconstruction of $x$, then $-\ln \sigma^2_\theta(z)$ will encourage the variance to go close to zero first before $\frac{1}{\sigma^2_\theta(z)}$ catches up. Consequently, the variational objective will shrink the variance of the Gaussian observation model toward zero aggressively. A more general (and mathematically-inclined) interpretation is that the density of finite samples under the Lebesgue measure is undefined; thus, if there exists $\theta$ such that the image of $\mu_\theta$ contains the empirical samples, then it is possible to construct a sequence of $(\theta_n)$ for which the log likelihood $\ln p_\theta(x)$ tends toward infinity. For carefully-chosen $\mu_\phi$ (under some assumptions about the expressivity of the inference model), the variational lower bound can tend toward infinity as well. Motivated by this consideration, the de facto practice when using Gaussian observation models is to set the decoder variance as a global hyperparameter. And since the variance of factorized Gaussian models interpreted as noising the samples, practitioners often choose to omit actual sampling of $p_\theta(x \giv z)$ when training VAE on image datasets—all so their generated images will look prettier.

#### Reparameterized Sampling

Finally, note the use of the function `gaussian_sample`

. This is basically a thin wrapper around `tf.random_normal`

, which itself implements the reparameterization trick by transforming a unit Gaussian according to the provided mean and standard deviation parameters.

#### Computing and Optimizing the Variational Objective

Now that we’ve constructed our encoder and generator, implementing the VAE computational graph is pretty simple. We first take a sample `T.trg_x`

and push it through the encoder. The encoder returns the stochastic embedding $z$ as well as the parameters of the underlying distribution from which $z$ was drawn. Next we feed $z$ to the decoder to compute $\hat{x}$.

Using the Jensen’s Inequality formulation of the (negative) variational objective, it is easy to see that we simply need to minimize the quantity

which translates directly to sampling a stochastic embedding and then computing

Since the variance for the observation model is fixed to 1, the reconstruction loss is simply the $\ell_2$-loss $0.5 * \| x - \mu_\theta(z) \|^2$ . Once set up, everything is trained with stochastic gradient descent or any of its variants. In my case, I’ve chosen to use the Adam optimizer.

#### Results

With everything set up, we can now test our VAE on a dataset. In lieu of MNIST, I thought it’d be more interesting to test VAE on the somewhat more challenging SVHN dataset. After about $50000$ mini-batch updates, the resulting loss curves are as follows.

Loss curves when training a VAE on the SVHN dataset.

And the sampled images (more specifically, samples of $\mu_\theta(z)$) are starting to look reasonable.

Random samples from our VAE.

### Conclusion

This post was designed to provide an in-depth look at the theory and practice of variational autoencoders. Many of the points I’ve discussed here are points that are also touched on by Carl Doersch in his Variational Autoencoder Tutorial, although we differ somewhat in our choice of presentation and emphasis. In particular, this post takes considerable care in separating the theory of learning latent variable models from the choice of parameterizing such models with deep neural networks. In doing so, I hope the reader will gain a better appreciation for the underlying mathematics of maximum likelihood and variational inference, as well as the elegance with which the Variational Autoencoder marries these ideas with deep learning techniques.

For those who are interested in knowing more about variational autoencoders, I am currently working on an even more research-oriented tutorial on variational autoencoders. There are some concepts that I’ve glossed over in this post which I think deserves extra careful treatment. I’ll try to have the tutorial ready sometime after the NIPS deadline c:

End of post