An Homage to REINFORCE

Paying Homage

It was recently brought to my attention that Ronald J. Williams of REINFORCE fame had passed away.He was also co-author to the seminal Learning Representations by Back-Propagating Errors paper. But I don't have a Nature subscription. In honor of his legacy, my initial plan was to read his original paper on the REINFORCE algorithm. To my chagrin, I realized pretty quickly that I couldn’t actually understand his paper. I’ve encountered this before with older papers—the manner in which math is presented in them often fly over my head. In lieu of this effort, I will instead go over the REINFORCE algorithm as I know it, with a sprinkling of sentimentality to illustrate how REINFORCE has been relevant to my own foray into deep learning.

Blackpropagation Through Random Variables

My introduction to deep learning began with the variational autoencoder. Given a latent variable model \(p(x, z)\) with latent \(z\) and observed variable \(x\), the goal is to indirectly maximize log-likelihood \(\ln p(x)\) by instead introducing a variational object \(q(z)\) and jointly optimizing the Evidence Lower Bound (ELBO), \(\newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \begin{align} \max_{p, q} \E_{q(z)} \ln \frac{p(x, z)}{q(z)}. \end{align}\) Okay, that’s quite a mouthful. If you’re not familiar with the ELBO, don’t worry. The statistical interpretation of this expression is not of importance to this post. Instead, given the above expression, let’s take a moment to decide how we should go about solving this optimization problem.

To any self-respecting deep learning researcher (deep learner?), the solution is obvious: gradient descent. And indeed, if you compute the gradient of the expression with respect to \(p\), it is quite straightforward,This footnote pays lip-service to the Leibniz integral rule and/or dominated convergence theorem. \(\begin{align} \nabla_p~\E_{q(z)} \ln \frac{p(x, z)}{q(z)} = \E_{q(z)} \nabla_p \ln p(x, z). \end{align}\) As a budding deep learner, I could followed along this much. But I was rather stumped by the gradient with respect to \(q\), \(\begin{align} \nabla_q~\E_{q(z)} \ln \frac{p(x, z)}{q(z)} = \E_{q(z)} \nabla_q~(\ldots\mathrm{??}\ldots), \end{align}\) because I wasn’t sure how to push the gradient through the expectation when they are both with respect to the same object. In other words: how does one backpropagate through random variables?

Forunately, Kingma’s original variational autoencoder paper offers an elegant approach when the random variable is continuous. Supposing there exists a function \(T_q\) (differentiable with respect to \(q\)) and a random variable \(\epsilon\) (which doesn’t depend on \(q\)) such that \(T_q(\epsilon)\) is identical in distribution to \(z \sim q(z)\), then we can simply reparameterize the expectation as \(\begin{align} \E_{q(z)} f(z) = \E_{\epsilon} f(T_q(z)), \end{align}\) for any function \(f\). Now that the expectation no longer depends on \(q\), we can safely push \(\nabla_q\) in. This is the famous reparameterization trick, and—some would argue—the main contribution of Kingma’s paper.

However, what if your random variable were discrete? You could use a continuous relaxation , but such approaches introduce bias to the gradient.In practice, these approaches do work under some circumstances. Determining when they do/don't work is a matter of trial-and-error. For the purpose of this post, let's pretend they don't work. As it turns out, around the same time that Kingma’s paper came out, Ranganath released his paper on black-box variational inference , which noted the following, \(\begin{align} \nabla_q~\E_{q(z)} f(z) &= \nabla_q \sum_z q(z) f(z) \\ &= \sum_z f(z) \nabla_q~q(z) \\ &= \sum_z f(z) q(z) \nabla_q \ln q(z) \\ &= \E_{q(z)} f(z) \nabla_q \ln q(z). \end{align}\) Of course, the actual derivation provided in the appendix is slightly more involved, due to the fact that \(f = \ln p / q\) itself depended on \(q\) and is thus subject to the product rule, but we shall overlook this small detail for now (I promise we’ll get back to it). If we take a step back, there are a handful of things to marvel about this identity.

  1. The identity is agnostic to whether \(z\) is continuous or discrete.
  2. The identity did not depend on the differentiability of \(f\) with respect to \(z\).
  3. The key insight to the derivation lies in the “log-derivative trick”, where we expand the expectation, convert \(\nabla_q~q(z)\) into \(q(z) \nabla_q \ln q(z)\), and then reconstruct the expectation.
  4. It exposes that the gradient direction is basically a form of random search, where weight each direction \(\nabla_q \ln q(z)\) by the probability of \(q(z)\) choosing that direction (thanks to the expectation with respect to \(q\)), as well as the reward assigned by the function \(f\).

Policy Gradient

I didn’t know it at the time, but this was basically the REINFORCE algorithm. I was not all that interested in reinforcement learning back then and shied away from sequential decision making because the math looked very scary. Looking back, the relation to sequential decision making was pretty obvious: given a 2-stepI'm only considering the 2-sequential action setting for simplicity. You can easily extend the formulation to the general case. composite policy (\(\pi\)) plus environment (\(p\)) distribution \(p_\pi = \pi(a \mid s)p(s' \mid s, a)\pi(a' \mid s')\) and reward function \(R(s, a, s', a')\), the REINFORCE algorithm exposes that we can simply construct the policy gradient as \(\begin{align} \nabla_\pi~\E_{p_\pi} R &= \E_{p_\pi} R(s, a, s', a') \nabla_\pi \ln p_\pi(s, a, s', a') \\ &= \E_{p_\pi} R(s, a, s', a') \nabla_\pi \ln \pi(a \mid s)\pi(a' \mid s'). \end{align}\) In other words, just follow trajectories based on your current policy, and then reinforce (haha) the trajectories that lead to better rewards. This, by the way, is at the heart of the Policy Gradient Theorem .

Control Variates

During my early research career, I was fortunate enough to have collaborated on several research works with Mohammad Ghavamzadeh, a prominent research scientist in reinforcement learning. None of his reinforcement learning expertise ever rubbed off on me. But I remember him saying “baselines” a lot and thinking to myself, “what the heck is a baseline?”

It wasn’t until sometime later while shuffling through the posters at NeurIPS 2017,This was the last time it was called NIPS. The number of immature jokes about the name was at an all-time high in 2017. It was honestly kind of annoying. that I stumbled across the work on REBAR . I do not remember who was presenting at the time,Actually, I don't even know for sure if it was the REBAR paper. My memory on this is pretty fuzzy. but he was kind enough to explain to me that, for any Monte Carlo estimator, one can apply something called a “control variate” to reduce its variance. For example, supposing we have an unbiased estimator \(X\) of some unknown quantity \(\mu\), we can always construct a new unbiased esimator of the form \(\begin{align} X^* = X + C, \end{align}\) so long as \(\E[C] = 0\). The key is whether you’re clever enough to find such a control variate \(C\) that is anti-correlated with \(X\) so as to cancel out the noise of \(X - \mu\).

In the context of REINFORCE, the question becomes what is a control variate to reduce the variance of the “expectand”Not a real word. But I like it. \(f(z) \nabla_q \ln q(z)\). Before we address it, we first need to embark on a tangent regarding a peculiar characteristic of the expectation of \(\nabla_q \ln q(z)\). In particular, \(\begin{align} \E_{q(z)} \nabla_q \ln q(z) &= \sum_z q(z) \frac{\nabla_q~q(z)}{q(z)} \\ &= \sum_z \nabla_q~q(z) \\ &= \nabla_q \sum_z q(z) \\ &= \nabla_q~1 \\ &= 0. \end{align}\) This is pretty remarkable. By using the log-derivative trick,It's amazing how frequently this trick pops up. I highly recommend having this identity in your mathematical arsenal. we see that the expectation of the score function is zero. This raises a natural and clever question: why not multiply the score by some choice of scalar \(b\) (after all, \(b \cdot 0 = 0\)) and use that as the control variate? This results in the following modified REINFORCE expressionWe shall introduce the control variate via a subtraction for reasons that will be immediately apparent. \(\begin{align} \nabla_q~\E_{q(z)} f(z) = \E_{q(z)} \bigg(f(z) - b\bigg) \nabla_q \ln q(z). \end{align}\) We have now reduced the original challenge of finding a control variate to determining a choice of \(b\) so that the expectand has reduced variance. As it turns out, a particularly effective choice of \(b\) is \(\begin{align} b = \E_{q(z)} f(z). \end{align}\) Let’s pause and appreciate this for a bit. This choice of \(b\) means that, rather than weighting the gradient by the reward \(f(z)\) directly, we instead weight it by the “advantage” (how much the reward beats the average reward achieved by \(q(z)\)). Somewhat sadistically, this means that rather than relying solely on positive reinforcements, we punish trajectories that underperform \(b\) and celebrate trajectories that out-perform \(b\). It’s almost like \(b\) is a baseline that we’re comparing the trajectories against. Ah. That’s why it’s called a baseline.

At this point, I must confess that I do not have a strong mathematical understanding of why the average reward is an effective choice of baseline. Daniel Seita’s post seems pretty promising and supposedly explains why this choice of baseline is optimal. But I haven’t read it yet, largely out of laziness, and partly out of mild skepticism. My rather small degree of skepticism comes from my encounter with the NVIL paper , which states in no uncertain terms the following

Though this approach to fitting the baseline [to the mean] does not result in the maximal variance reduction, it is simpler and in our experience works as well as the optimal approach of .

I’m sure there’s a good reason why Seita and the authors of NVIL came to very different conclusions about the optimality of the mean-reward baseline. I just can’t be bothered to figure it out at the moment. So I encourage you, dear reader, to perform your own investigation.

So About that ELBO Gradient

I remarked some sections ago that we did not do full justice to the application of REINFORCE to the gradient of the ELBO with respect to \(q\). Let’s revisit that again, this time in full detail, because I have a pleasant surprise for you. Recall that because \(f = \ln p / q\) includes \(q\) itself, we must apply the product rule and contend with two terms, \(\begin{align} &\nabla_q~\E_{q(z)} \ln \frac{p(x, z)}{q(z)} \\ &= \sum_z \nabla_q~\bigg( q(z) \ln \frac{p(x, z)}{q(z)} \bigg) \\ &= \E_{q(z)} \bigg( \ln \frac{p(x, z)}{q(z)} \bigg) \nabla_q \ln q(z) - \E_{q(z)} \nabla_q \ln q(z). \end{align}\) The first term is the REINFORCE term that we know and love, while the second term is, shall we say, a by-product. But this by-product should look very familiar to us now: it is the expectation of the score function, which we know to be zero! I find that quite pretty.


I’m not much of a reinforcement learning researcher and, to this day, have never fully read a single reinforcement learning paper in my life. I think it is a testament to the significance and versatility of the REINFORCE algorithm that it somehow managed to weasel its way into the life of a variational autoencoder-obsessed researcher such as myself.

Looking back, I also somewhat lament the mathematical presentation in the reinforcement learning literature; I wish it had been written in a way that was more accessible to my brain. It took me a very long time to appreciate the elegance, inter-connectedness, and simplicity of the concepts mentioned in this post.Don't even get me started on VIMCO and PPO. I'm mad at myself for how long it took me to finally read and realize how simple these algorithms actually are. Sorry to Jiaming, who unsuccessfully tried to convince me of these algorithms' simplicity during our late-night chats in the office at Stanford. But whine as much as I want about mathematical dissemination, I think the reality is that everyone thinks about math a little differently and no one is able to cater math in a way that is optimal for your mental digestion.Even the math in this blog post which I've tried my best to streamline to my liking will, I am sure, come across as completely ridiculous and unreadable to some. It took years of fumbling, reading and re-reading papers, before I managed to translate certain math into a form that my brain found palatable. And I think that’s what I’ll continue to do: trying and trying again, often failing, giving up and revisiting. Until something eventually clicks. And then I latch onto it, this time a little wiser. Much like the REINFORCE algorithm.