It was recently brought to my attention that Ronald J. Williams of REINFORCE fame *Learning Representations by Back-Propagating Errors* paper. But I don't have a Nature subscription.

My introduction to deep learning began with the variational autoencoder. Given a latent variable model \(p(x, z)\) with latent \(z\) and observed variable \(x\), the goal is to indirectly maximize log-likelihood \(\ln p(x)\) by instead introducing a variational object \(q(z)\) and jointly optimizing the Evidence Lower Bound (ELBO),

\[\newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \begin{align*} \max_{p, q} \E_{q(z)} \ln \frac{p(x, z)}{q(z)}. \end{align*}\]Okay, that’s quite a mouthful. If you’re not familiar with the ELBO, don’t worry. The statistical interpretation of this expression is not of importance to this post. Instead, given the above expression, let’s take a moment to decide how we should go about solving this optimization problem.

To any self-respecting deep learning researcher (deep learner?), the solution is obvious: gradient descent. And indeed, if you compute the gradient of the expression with respect to \(p\), it is quite straightforward,

As a budding deep learner, I could followed along this much. But I was rather stumped by the gradient with respect to \(q\),

\[\begin{align*} \nabla_q~\E_{q(z)} \ln \frac{p(x, z)}{q(z)} = \E_{q(z)} \nabla_q~(\ldots\mathrm{??}\ldots), \end{align*}\]because I wasn’t sure how to push the gradient through the expectation when they are both with respect to the same object. In other words: how does one backpropagate through random variables?

Forunately, Kingma’s original variational autoencoder paper

for any function \(f\). Now that the expectation no longer depends on \(q\), we can safely push \(\nabla_q\) in. This is the famous reparameterization trick, and—some would argue—the main contribution of Kingma’s paper.

However, what if your random variable were discrete? You could use a continuous relaxation

Of course, the actual derivation provided in the appendix is slightly more involved, due to the fact that \(f = \ln p / q\) itself depended on \(q\) and is thus subject to the product rule, but we shall overlook this small detail for now (I promise we’ll get back to it). If we take a step back, there are a handful of things to marvel about this identity.

- The identity is agnostic to whether \(z\) is continuous or discrete.
- The identity did not depend on the differentiability of \(f\) with respect to \(z\).
- The key insight to the derivation lies in the “log-derivative trick”, where we expand the expectation, convert \(\nabla_q~q(z)\) into \(q(z) \nabla_q \ln q(z)\), and then reconstruct the expectation.
- It exposes that the gradient direction is basically a form of random search, where weight each direction \(\nabla_q \ln q(z)\) by the probability of \(q(z)\) choosing that direction (thanks to the expectation with respect to \(q\)), as well as the reward assigned by the function \(f\).

I didn’t know it at the time, but this was basically the REINFORCE algorithm. I was not all that interested in reinforcement learning back then and shied away from sequential decision making because the math looked very scary. Looking back, the relation to sequential decision making was pretty obvious: given a 2-step

In other words, just follow trajectories based on your current policy, and then reinforce (haha) the trajectories that lead to better rewards. This, by the way, is at the heart of the Policy Gradient Theorem

During my early research career, I was fortunate enough to have collaborated on several research works with Mohammad Ghavamzadeh, a prominent research scientist in reinforcement learning. None of his reinforcement learning expertise ever rubbed off on me. But I remember him saying “baselines” a lot and thinking to myself, “what the heck is a baseline?”

It wasn’t until sometime later while shuffling through the posters at NeurIPS 2017,

so long as \(\E[C] = 0\). The key is whether you’re clever enough to find such a control variate \(C\) that is anti-correlated with \(X\) so as to cancel out the noise of \(X - \mu\).

In the context of REINFORCE, the question becomes what is a control variate to reduce the variance of the “expectand”

This is pretty remarkable. By using the log-derivative trick,

We have now reduced the original challenge of finding a control variate to determining a choice of \(b\) so that the expectand has reduced variance. As it turns out, a particularly effective choice of \(b\) is

\[\begin{align*} b = \E_{q(z)} f(z). \end{align*}\]Let’s pause and appreciate this for a bit. This choice of \(b\) means that, rather than weighting the gradient by the reward \(f(z)\) directly, we instead weight it by the “advantage” (how much the reward beats the average reward achieved by \(q(z)\)). Somewhat sadistically, this means that rather than relying solely on positive reinforcements, we punish trajectories that underperform \(b\) and celebrate trajectories that out-perform \(b\). It’s almost like \(b\) is a baseline that we’re comparing the trajectories against. Ah. That’s why it’s called a baseline.

At this point, I must confess that I do not have a strong mathematical understanding of why the average reward is an effective choice of baseline. Daniel Seita’s post seems pretty promising and supposedly explains why this choice of baseline is optimal. But I haven’t read it yet, largely out of laziness, and partly out of mild skepticism. My rather small degree of skepticism comes from my encounter with the NVIL paper

Though this approach to fitting the baseline [to the mean] does not result in the maximal variance reduction, it is simpler and in our experience works as well as the optimal approach of

.

I’m sure there’s a good reason why Seita and the authors of NVIL came to very different conclusions about the optimality of the mean-reward baseline. I just can’t be bothered to figure it out at the moment. So I encourage you, dear reader, to perform your own investigation.

I remarked some sections ago that we did not do full justice to the application of REINFORCE to the gradient of the ELBO with respect to \(q\). Let’s revisit that again, this time in full detail, because I have a pleasant surprise for you. Recall that because \(f = \ln p / q\) includes \(q\) itself, we must apply the product rule and contend with two terms,

\[\begin{align*} &\nabla_q~\E_{q(z)} \ln \frac{p(x, z)}{q(z)} \\ &= \sum_z \nabla_q~\bigg( q(z) \ln \frac{p(x, z)}{q(z)} \bigg) \\ &= \E_{q(z)} \bigg( \ln \frac{p(x, z)}{q(z)} \bigg) \nabla_q \ln q(z) - \E_{q(z)} \nabla_q \ln q(z). \end{align*}\]The first term is the REINFORCE term that we know and love, while the second term is, shall we say, a by-product. But this by-product should look very familiar to us now: it is the expectation of the score function, which we know to be zero! I find that quite pretty.

I’m not much of a reinforcement learning researcher and, to this day, have never fully read a single reinforcement learning paper in my life. I think it is a testament to the significance and versatility of the REINFORCE algorithm that it somehow managed to weasel its way into the life of a variational autoencoder-obsessed researcher such as myself.

Looking back, I also somewhat lament the mathematical presentation in the reinforcement learning literature; I wish it had been written in a way that was more accessible to my brain. It took me a very long time to appreciate the elegance, inter-connectedness, and simplicity of the concepts mentioned in this post.