26 Dec 2017
ACGAN Learns a Biased Distribution
Recently, I did a few small experiments to verify my hypothesis that the Auxiliary Classifier Generative Adversarial Network (ACGAN) model has a particular downsampling behavior. The goal of this post is to share some of the experiments I conducted while playing with ACGAN.
What is ACGAN?
First and foremost, we need to know what ACGAN is. The Auxiliary Classifier Generative Adversarial Network (ACGAN) is a GAN variant proposed by Odena, et al. in their 2017 ICML paper. Now, let $p^*(x, y)$ denote the true imagelabel joint distribution and let $p_\theta(x, y)$ denote the generator joint distribution. In the ACGAN paper, the objective functions are succinctly summarized as
where $D_\psi$ is the discriminator, $p_\theta$ is the generator, and $q_\phi$ is the auxiliary classifier. The paper further proposes to share parameters between $D_\psi$ and $q_\phi$ by compressing both into a twoinone neural network. The minimax game played between the discriminator and generator is thus approximated by training the discriminator/auxiliary classifier to maximize $L_c + L_s$ and training the generator to maximize $L_c  L_s$.
Personally, I’m not a big fan of this presentation. The interaction resulting from parametersharing between the discriminator and auxiliary classifier is especially tricky to characterize. To make our lives simpler, it’s best to think of the discriminator and auxiliary classifier as two separate objects that, optionally, share neural network parameters for the sake of convenience. Furthermore, by noting that the alternating game is really just so that the discriminator loss approximates the JensenShannon divergence (up to a constant), I think we’re better off describing the objective function that we’re minimizing w.r.t. $(\theta, \phi)$ as
where $d(\cdot, \cdot)$ denotes the JS divergence. From here on out, I shall refer to this as the ACGAN objecitve function. I like this formulation better as it exposes the underlying machinery of the ACGAN model and better illustrates the purpose of the auxiliary classifier. The first term is your standard GAN objective for training the generator. The second term ensures that the auxiliary classifier is trained to approximate the true posterior $p^*(y \mid x)$. Once $q_\phi$ is trained to become a sufficiently good classifier, the third term now plays an interesting role. The intuition is simple: the generator, conditioned on a class label, is supposed to sample an image. But if the generator creates really terrible images, then (hopefully) the classifier will be very confused and not be able to tell which class the generator was attempting to sample from. This thus encourages the generator to stay away from awful, hardtoclassify images and attracts it toward coherent, easytoclassify images instead.
What Counts as Hard to Classify?
The ease with which the auxiliary classifier can accurately classify the generated images is exactly measured by the third term
But we can get a little more insight into the properties of this term by interpreting it as the variational upper bound of the generator’s conditional entropy $H_\theta(Y \mid X)$, whereby
and where the conditional entropy is equivalently expressed as
Here, $H(\cdot)$ denotes the entropy functional. This formulation shows that minimizing the original term actually consists of two parts: minimizing $H_\theta(Y \mid X)$ and making $q_\phi(y \mid x)$ a good variational approximation of $p_\theta(y \mid x)$. Images generated from the generator are thus easy to classify if and only if the following two conditions are met:
 $\E_{p_\theta(x)} \brac{ H(p_\theta(y \mid x)) }$ is minimized
 $q_\phi(y \mid x)$ is a good approximation of $p_\theta(y \mid x)$.
But recall that $q_\phi(y \mid x)$ is also trained to approximate $p^*(y \mid x)$. This adds the third condition:
 $\E_{p_\theta(x)} \brac{ H(p_\theta(y \mid x)) }$ is minimized
 $q_\phi(y \mid x)$ is a good approximation of $p_\theta(y \mid x)$.
 $q_\phi(y \mid x)$ is a good approximation of $p^*(y \mid x)$.
If you close one eye and squint the other, then you can sort of see that minimizing the ACGAN objective implies minimizing $\E_{p_\theta} \brac{ H(p^*(y \mid x))}$. In other words, the generator is discouraged from sampling near the true decision boundary $p^*(y \mid x)$.
Wait, What if the Images are Supposed to be Near the Decision Boundary?
If the generator downsamples images that are close to the decision boundary… what happens if the true density $p^*(x)$ is concentrated near the decision boundary? Therein lies a fundamental problem with the ACGAN objective. One of the reasons why ACGAN performs well on visual inspection and Inception Score is because these metrics reward the generator for sampling easilyrecognized images uniformly across the image classes. If, however, the true distribution $p^*$ contains points that are fundamentally difficult to classify because they lie near the true decision boundary, then ACGAN will downsample these points and thus learn a biased distribution!
It shouldn’t come as too much of a surprise that ACGAN downsamples points near the decision boundary. After all, being near the decision boundary, pretty much by definition, means the image is hard to classify. What is surprising, however, is how aggressively ACGAN actually downsamples points near the decision boundary. To demonstrate this downsampling behavior, I conducted two toy experiments.
Toy Experiment #1
In the first toy experiment, I took the MNIST training set and constructed my own MNIST variant that contains only two classes: A and B. Class A only contains images of 0’s and 1’s, whereas class B only contains images of 0’s and 2’s. Because images of 0’s are as likely to belong in class A as class B, the all images of 0’s lie exactly in the decision boundary of the Bayesoptimal A/B classifier. If our analysis is correct, then ACGAN ought to downsample images of 0’s while successfully generating images of 1’s and 2’s. And indeed, that’s exactly what we see:
Here, I trained ACGAN, which happens to contain a single binary latent variable. It turns out that we can easily convert the ACGAN objective into the GAN or InfoGAN objectives:
 GAN Objective: ACGAN Objective Term #1
 InfoGAN Objective: ACGAN Objective Term #1 + #3
So I decided to train GAN and InfoGAN as well. In the above picture, $\lambda_m$ denotes the weighting of the third term. The greater the third term, the more we expect the model to downsample points near the auxiliary classifier decision boundary. Because of this behavior, we see something quite interesting regarding how GAN, InfoGAN, and ACGAN decides to leverage the binary latent variable.
The GAN objective function does not encourage the binary latent variable to be used in any particular way; all the GAN cares about is generating realistic images of 0’s, 1’s, and 2’s. And if the discrete latent variable isn’t useful, so be it.
The InfoGAN objective function also cares about downsampling the points near the auxiliary classifier. However, InfoGAN is allowed to choose the behavior of the auxiliary classifier (and correspondingly, the behavior of the discrete latent variable). It therefore decided that the discrete latent variable and auxiliary classifier should work in tandem to associate A = {0} and B = {1, 2}. This way, all the images are far from the decision boundary.
The ACGAN objective function, however, uses Term #2 to tie the behavior of the auxiliary classifier to the true class correspondences (where A = {0, 1}, and B = {0, 2}). As such, since images of 0’s lie in the Bayesoptimal decision boundary, and since auxiliary classifier is trained to approximate the Bayesoptimal decision boundary, images of 0’s are therefore near (or on) the auxiliary classifier decision boundary and thus downsampled. Indeed, when $\lambda_m = 2$, the downsampling is actually quite aggressive.
To quantify the downsampling behavior, I computed the relative frequency of 0’s, 1’s, and 2’s as we increase the weighting of $\lambda_m$. We can see that the moment $\lambda_m$ becomes nonzero, the downsampling behavior immediately starts occurring. And by the time $\lambda_m > 2$, ACGAN almost never samples 0’s, despite 0’s making up 50% of the true image distribution!
Toy Experiment #2
To demonstrate this downsampling behavior in a slightly lesscontrived scenario, I trained ACGAN on the original MNIST dataset. By doing so, something quite amusing becomes apparent: ACGAN never generates a serifed 1!
Real MNIST 1’s
ACGAN generated 1’s (note the lack of serif protrusions)
It turns out that when you add protrusions to 1’s, it starts looking like a 2 (or possibly a 7). As such, serif 1’s are closer to the classifier decision boundary than sansserif 1’s… and are therefore downsampled. In other words:
I was pleasantly surprised with the experiment results and decided to convert it into a small submission to the Bayesian Deep Learning workshop. The submission includes additional experiments that demonstrate how ACGAN can actually achieve a higher Inception Score than the original image distribution. From idea conception to paper submission, the entire process took a total of two days, which was the fastest I’ve ever completed a project! I also had a lot of fun making the poster, and even snuck SmileyBall into the end of it.^{1}

I didn’t tell my advisors about it
( *u*)
↩
End of post