It was recently brought to my attention that Ronald J. Williams of REINFORCE fame *Learning Representations by Back-Propagating Errors* paper. But I don't have a Nature subscription.

My introduction to deep learning began with the variational autoencoder. Given a latent variable model \(p(x, z)\) with latent \(z\) and observed variable \(x\), the goal is to indirectly maximize log-likelihood \(\ln p(x)\) by instead introducing a variational object \(q(z)\) and jointly optimizing the Evidence Lower Bound (ELBO), \(\newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \begin{align} \max_{p, q} \E_{q(z)} \ln \frac{p(x, z)}{q(z)}. \end{align}\) Okay, that’s quite a mouthful. If you’re not familiar with the ELBO, don’t worry. The statistical interpretation of this expression is not of importance to this post. Instead, given the above expression, let’s take a moment to decide how we should go about solving this optimization problem.

To any self-respecting deep learning researcher (deep learner?), the solution is obvious: gradient descent. And indeed, if you compute the gradient of the expression with respect to \(p\), it is quite straightforward,

Forunately, Kingma’s original variational autoencoder paper

However, what if your random variable were discrete? You could use a continuous relaxation

- The identity is agnostic to whether \(z\) is continuous or discrete.
- The identity did not depend on the differentiability of \(f\) with respect to \(z\).
- The key insight to the derivation lies in the “log-derivative trick”, where we expand the expectation, convert \(\nabla_q~q(z)\) into \(q(z) \nabla_q \ln q(z)\), and then reconstruct the expectation.
- It exposes that the gradient direction is basically a form of random search, where weight each direction \(\nabla_q \ln q(z)\) by the probability of \(q(z)\) choosing that direction (thanks to the expectation with respect to \(q\)), as well as the reward assigned by the function \(f\).

I didn’t know it at the time, but this was basically the REINFORCE algorithm. I was not all that interested in reinforcement learning back then and shied away from sequential decision making because the math looked very scary. Looking back, the relation to sequential decision making was pretty obvious: given a 2-step

During my early research career, I was fortunate enough to have collaborated on several research works with Mohammad Ghavamzadeh, a prominent research scientist in reinforcement learning. None of his reinforcement learning expertise ever rubbed off on me. But I remember him saying “baselines” a lot and thinking to myself, “what the heck is a baseline?”

It wasn’t until sometime later while shuffling through the posters at NeurIPS 2017,

In the context of REINFORCE, the question becomes what is a control variate to reduce the variance of the “expectand”

At this point, I must confess that I do not have a strong mathematical understanding of why the average reward is an effective choice of baseline. Daniel Seita’s post seems pretty promising and supposedly explains why this choice of baseline is optimal. But I haven’t read it yet, largely out of laziness, and partly out of mild skepticism. My rather small degree of skepticism comes from my encounter with the NVIL paper

Though this approach to fitting the baseline [to the mean] does not result in the maximal variance reduction, it is simpler and in our experience works as well as the optimal approach of

.

I’m sure there’s a good reason why Seita and the authors of NVIL came to very different conclusions about the optimality of the mean-reward baseline. I just can’t be bothered to figure it out at the moment. So I encourage you, dear reader, to perform your own investigation.

I remarked some sections ago that we did not do full justice to the application of REINFORCE to the gradient of the ELBO with respect to \(q\). Let’s revisit that again, this time in full detail, because I have a pleasant surprise for you. Recall that because \(f = \ln p / q\) includes \(q\) itself, we must apply the product rule and contend with two terms, \(\begin{align} &\nabla_q~\E_{q(z)} \ln \frac{p(x, z)}{q(z)} \\ &= \sum_z \nabla_q~\bigg( q(z) \ln \frac{p(x, z)}{q(z)} \bigg) \\ &= \E_{q(z)} \bigg( \ln \frac{p(x, z)}{q(z)} \bigg) \nabla_q \ln q(z) - \E_{q(z)} \nabla_q \ln q(z). \end{align}\) The first term is the REINFORCE term that we know and love, while the second term is, shall we say, a by-product. But this by-product should look very familiar to us now: it is the expectation of the score function, which we know to be zero! I find that quite pretty.

I’m not much of a reinforcement learning researcher and, to this day, have never fully read a single reinforcement learning paper in my life. I think it is a testament to the significance and versatility of the REINFORCE algorithm that it somehow managed to weasel its way into the life of a variational autoencoder-obsessed researcher such as myself.

Looking back, I also somewhat lament the mathematical presentation in the reinforcement learning literature; I wish it had been written in a way that was more accessible to my brain. It took me a very long time to appreciate the elegance, inter-connectedness, and simplicity of the concepts mentioned in this post.