 Rui Shu

# 12 Feb 2018 Density Estimation: Maximum Likelihood

I’m starting a series of blog posts on the general topic of density estimation with deep generative models. The goal of this series is to cover several popular techniques in deep generative modeling—in particular, models that compute and sample from some learned distribution. This includes: variational autoencoders, normalizing flow, and autoregressive models. All of these are powerful techniques for modeling densities, and they can be combined to yield even more expressive models. Since I’ve never actually implemented normalizing flows and autoregressive models before, this is will also a chance for me to become more familiar with them. To start this series off, this blog post will go over the general idea of density estimation and a high-level idea of how to learn density models.

### Why Do Density Estimation?

Probability is all around us. Whether you’re guessing a coin flip, a die roll, whether your paper will get accepted, or whether your blog is going to be successful (ahem), you’re inherently asking, “what is the probability of each possible outcome?” And if you’re a statistician or machine learning practitioner, you naturally follow up the question with, “can I model the probability?”

Anyone who has ever played with logistic or linear regression has already seen density estimation at work. In logistic regression: given an input $x$, what’s the probability $x$ belongs to the positive class? In linear regression: given an input $x$, what’s the probability the target response is $y$ (under the Gaussian noise assumption)? Both are simple cases of conditional density estimation.

There are also plenty of cases of unconditional density estimation. This includes questions like what is the probability of a particular sentence being generated by a human? Or what’s the probability that a specific image is generated? Suffice to say, density estimation is everywhere and we want to arm ourselves with the artillery for modeling interesting probability distributions.

### A General Framework of Density Estimation

A density or generative model $g$ describes a procedure for generating samples from some distribution, which we denote with the random variable $X_g$. Note that this random variable $X_g$ is indexed by $g$ (our generative model), meaning that if the model $g$ changes, then the procedure for generating the samples $x$ changes, and thus the distribution $X_g$ should change as well. More concisely, the random variable $X_g$ is a function of $g$.

Consider a model family $\mathcal{G}$, which describes a set of possible generative models. The goal of training a generative model is to select $g \in \mathcal{G}$ such that the corresponding $X_g$ simulates some data distribution $X^*$. Such an $X^*$ could, for example, be the data distribution of all possible images, and we wish to build a generative model to approximate $X^*$. Formally, we wish to solve the optimization problem

where $d(\cdot, \cdot)$ is some measure of divergence (for intuition’s sake, think of this as a “distance”) between $X^*$ and $X_g$. A common choice of divergence is the Kullback-Leibler divergence $\KL$,

where the KL-divergence is defined as

and where $p^*(x) = \Pr(X^* = x), p_g(x) = \Pr(X_g = x)$. It turns out that minimizing the KL-divergence is exactly equivalent to maximizing the likelihood (also known as the negative cross-entropy)

Broadly speaking, variational autoencoders, normalizing flow, and autoregressive models are different philosophical schemes for how $g$ is constructed. VAEs use a latent variable model. Normalizing flow uses the inverse transform trick. And autoregressive models use the chain rule. Since these are orthogonal approaches, nothing’s stopping you from using all three philosophies together; this has been done before, leading to models like VLAE, PixelVAE, Inverse Autoregressive Flow VAE, etc. Over the coming <god knows how long>, I will go over the mathematical principles underlying these building blocks as well as how to code them up.

End of post 