 Rui Shu

# 19 May 2018 Change of Variables: A Precursor to Normalizing Flow

Normalizing flow is a cool technique for density estimation that is fun to learn about and (tricky to) wrap your mind around. Those familiar with the term may have come across it within the context of models such as NICE, Real NVP, Normalizing Flow VAE, Inverse Autoregressive Flow, and Masked Autoregressive Flow. At its core, normalizing flow relies on the change of variables technique. As preparation for an up-coming blog on normalizing flow, the goal of this blog post is to provide readers with a gentle introduction to the change of variables concept in the univariate case.

### Inverse Transform Sampling

Consider the following scenario: we have access to a random number generator $U$ that samples uniformly within the interval $(0, 1)$. However, we are interested in sampling from a logistic distribution. What should we do? It turns out that it is possible to construct some function $f$ such that the procedure

is equivalent to sampling from a logistic distribution. While there are many ways to construct such a function, the inverse transform sampling framework suggests to choose $f$ as the inverse cumulative distribution function (inverse CDF) associated with the logistic distribution. To see why this works, let $X$ denote the logistic distribution with corresponding CDF

The implicit distribution $f(U)$ has the corresponding CDF

We wish to show that $f(U)$ and $X$ share the same CDF. To do so, we note that $F$ maps samples in $\X$ to the unit interval $(0, 1)$. By interpreting $F(x)$ as a Bernoulli random variable parameter, we note that

Great! Now we just have to figure out what the logistic distribution’s CDF is. After a quick peek on Wikipedia, we find out that the logistic distribution CDF is simply the sigmoid function

What a conveniently simple expression!1 By taking its inverse, we find that

Armed with this knowledge, we can now set $f = F^{-1}$. To sanity check that our analysis is correct, we can take samples from $f(U)$ and approximate the distribution of $f(U)$ with a KDE plot. Sure enough, we have successfully constructed a way to sample from the logistic distribution by transforming a uniform distribuion! And perhaps more importantly, we now have a sense that this procedure of introducing some function $f$ that transforms an existing simple distribution $U$ is possibly a powerful method for accessing more complicated distributions.

### A Taste of Complexity

By using a complicated transformation function $f$, we should be able sample from a complicated distribution. To demonstrate this, let’s define $f(x)$ as

Hopefully that looks complicated enough. If you’re wondering how I got this function, let’s just say WolframAlpha did most of the heavy lifting.2 Assuming that you’re willing to accept that I pulled this function out of thin air, let’s check out what the function looks like, as well as the KDE of $f(U)$. I look the liberty of plotting out $f(u)$, $F(x)$, as well as the KDE of $f(U)$. There are several things to note. First, $f$ is a strictly monotonic—and thus invertible—function (this will come in handy next section). Second, $f(U)$ is a bimodal distribution. Third, one can think of this bimodal distribution as being achieved by stretching and squeezing certain parts of the original uniform distribution. This is most apparent in the first two plots, where the uniformly-spaced lines in the space of $u$ get distorted when we map to the space of $x$. Regions of $u$ that get squeezed together correspond to regions of high density in $x$. Inversely, regions of $u$ that get stretched correspond to regions of low density in $x$. Generally, the intuition seems to be that the density within any region of $x$ is inversely proportional to how much the corresponding region of $u$ gets stretched.

### Computing the Density

Given the aforementioned complicated $f$, sampling $x \sim f(U)$ is pretty straightforward. However, we have yet to discuss how to actually compute the probability density $p_X(x)$ (here, $p_X$ denotes the PDF associated with the $f(U)$). Naively, one might suggest to simply compute the inverse $u = f^{-1}(x) = F(x)$ and evaluate $p_U(u)$. However, it shouldn’t take long to realize that something is amiss with this strategy: for any $u \in (0, 1)$, the function $p_U(u)$ is a constant that evaluates to $1$! It turns out that this strategy fails to consider the stretching/squeezing of the space as we map between $x$ and $u$. To account for this, the change of variables theorem (under certain assumptions3) concludes that we should scale $p_U(u)$ by (the inverse of) how much the neighborhood of $u$ gets stretched when we map to $x$. This is summarized by the following equation

Note that this equation exactly captures our intuition from the previous section that the density of $x$ is inversely proportional to the degree that the neighborhood of $u$ gets stretched by $f$. Furthermore, since $u = f^{-1}(x)$, it follows that

There are thus two ways to compute $p_X(x)$, depending on whether we have access to the derivative of $f$ or the derivative of $f^{-1}$. Since we have direct access to $f$, we can leverage the magic of automatic differentiation to evaluate the derivative of $f$ fairly easily. We’ll therefore make use of the former change-of-variables equation and run the following TensorFlow code. Looks like the change-of-variables approach works! By sanity-checking the PDF given by the change-of-variables and the PDF approximated given by KDE, we see that the two PDFs are more or less the same. ### Extension to the Multivariate Regime

Hopefully by now, we’re all slightly more comfortable with the use of differentiable, invertible functions to transform simple distributions into more complicated distributions. To end this post, there are two loose ends that are worth tying up. First, let’s see the change-of-variables theorem to its full multivariate glory.

To get an intuition for what the change-of-variables is doing in the multivariate setting, note that the transformation that $f$ applies to a sufficiently small neighborhood of $u$ is just an affine transformation (i.e. we’re taking the first-order approximation of $f$) whose linear component is the matrix multiplication of $u$ by the Jacobian of $f$. Now, we simply need to remember that the change-in-volume induced by a linear transformation via the matrix $T$ is exactly the determinant of $T$. Thus, the determinant of the Jacobian of $f$ describes how much a small parallelotope/hypercube containing $u$ gets stretched by $f$.

The final thing to note is how on earth do we come up with arbitrary, invertible functions? If the goal is to do density estimation using the change-of-variables method, we will need to somehow come up with an entire family of invertible and differentiable functions and furthermore perform optimization over that family of functions. The construction of invertible function families that are amenable to gradient-based optimization will be the main topic of discussion in the future post on normalizing flows. So stay tuned!

1. If you wondered why the thought experiment didn’t involve a Gaussian distribution, this is why.

2. I computed the CDF of a mixture of two logistic distributions, used WolframAlpha to compute the inverse analytically, and then randomly perturbed the numbers a little so that it becomes some weird bimodal distribution that isn’t quite a mixture of two logistic distributions—just to make things more fun.

3. Continuously differentiable, invertible $f$ with non-zero derivative.

End of post 