Introduction

Have you ever wondered what it would be like if your favorite movie or TV show character had a completely different gender?

gender transformation using stylegan for game of throne character
Gender transformation for Jaime Lannister from HBO's Game of Thrones
Try in a notebook

Wonder no more! Thanks to StyleGAN, a cutting edge deep learning algorithm from Nvidia research, you (yes YOU!) can explore the fascinating world of generative adversarial Westeros.

StyleGAN can also generate creepy smiling animations like this:

Smiling animation generated from a single image
Smiling animation generated from a single image

But hold your horses. Before neural networks can dream up what Jon and Daenerys’ kid looks like (oops, spoiler warning, I guess), we need to take a step back and clearly define what exactly we need to make sure that we aren’t doing this:

Machine learning joke
Source

If you’ve come to this article, where getting the Gaussian curvature of Jon Snow’s hair on your generated images is your most pressing issue at the moment, I’m going to assume that you at least know how convolutional neural networks work.

Since this article is about StyleGAN (and figuring out what Jon and Daenerys’ kid will look like), I'm only providing a surface level overview of the GAN framework.

If you want to dive deeper into GAN territory, check out Ian Goodfellow’s NIPS 2016 tutorial. It’s one of the best resources for learning about GANs, taught by the GANfather himself.

With that said, let’s get into it.

Generative Adversarial Networks

Most people like to explain GANs with the (admittedly very good) analogy of a counterfeiter and a cop.

However, I don’t think that’s the most exciting way of looking at GANs, especially if you’re already indoctrinated into the cult of training neural networks.

The most important part of a generative adversarial network is, well, the thing that generates images. Unsurprisingly, this bit is called a generator.

The Generator

The generator is a neural network. But not just any ordinary kind.

It uses a special kind of layer called the transposed convolutional layer (sometimes incorrectly called deconvolution).

Transposed convolutions, also sometimes correctly called fractionally strided convolutions (hey don’t ask me; I’m not the one coming up with these names), are an elegant operation that can upscale an image.

To truly understand transposed convolutions and why the deep learning community can’t seem to settle on a name for the darn thing, I’d recommend reading  Naoki Shibuya’s article on the subject.

In short, this animation summarizes how to use a transposed convolution to upscale a 2x2 matrix to a 5x5 matrix:

Transposed convolution with filter size 3 and stride 2
Transposed convolution with filter size 3 and stride 2. Source

Again, I’m going to skip the gory details, so if you want to dig in, you could also check out a guide to convolution arithmetic.

Since it’s deep learning, and we absolutely must utilize all buzzwords to their maximum potential in order to satisfy potential investors that our totally new, never-seen-before matrix multiplications are going to change the world, it makes sense to stack a bunch of these layers to get a neural network that can upscale images to reasonably large sizes.

So the final architecture for the image generator looks something like this:

architecture for the image generator
Source


Of course, without any sensible notion of what the weights of these convolution filters are, all our generator model can do for now is spit out random noise. That kind of sucks.

What we need now, other than a hard drive full of images, is a loss function.

We need something to tell our generator how wrong or right it is. A teacher, if you will.

Teacher Teaching kids
Photo by NeONBRAND

For image classification, this loss function was pretty much given to us by the gods of mathematics. Since we pairs of images and labels, we can do this:

$$\hat{y} = \text{neural net}(x)$$
$$y = \text{the actual, correct label given in the dataset}$$
$$\text{loss} = (y-\hat{y})^2$$

Of course, depending on the task, you might want to use cross-entropy loss or something like that.

But I digress. The point is that the labeled data allows us to construct a differentiable loss function that we can slide down (using backpropogation and gradient descent).

We need something similar for our generator network.

Ideally, a proper loss function should tell us how realistic our generated images are. Because once we have such a function, we can maximize it using known methods (i.e. backpropogation and gradient descent).

Unfortunately, in the eyes of logarithms and cosines, Sansa Stark and Gaussian noise are pretty much the same things.

Sansa stark
Source

We had a neat mathematical equation for loss in the image classification example, but we can’t have something similar here, because math can’t construct a differentiable function that tells us how real or fake the generated images are (which, if you’ve been sleeping through this section, is precisely what we need).

Let me say it again: take in an image and return a number that says if it’s real or fake (“1” if it’s real, and “0” if it’s fake).

Input: Image. Output: binary value.

Are you getting it? This isn’t just a loss function; it’s a whole other neural network.

The Discriminator

The model that discriminates between real images and fake images is called, unsurprisingly, the discriminator.

birds sitting on a line
Source

The discriminator is a convolutional neural network that is trained to predict whether input images are real or fake. It outputs “1” if it thinks the image is real, and “0“ if it thinks the image is fake.

So from the generator network’s perspective, the discriminator acts as a loss function.

If the generator updates it’s parameters in such a way that it generates images which, when fed through the discriminator, produce values close to zero, it creates images that don’t look like the result of a three-year-old smashing a baseball at a TV screen.

Image by Yatheesh Gowda from Pixabay
Image by Yatheesh Gowdafrom Pixabay

At the end of the day, your GAN should look like this:

GAN workflow
Source

Putting it All Together

So to summarize, here’s the step the-by-step process to creating a GAN-based image generator:

  1. The generator (a neural network with transposed convolutional layers) generates images, most of which will look like garbage.
  2. The discriminator takes in a bunch (or more accurately, a mini-batch) of images, of which some are real (from a large dataset), and some are fake (from the generator).
  3. The discriminator attempts to perform binary classification to predict which images are real (by outputting “1”) and which images are fake (by outputting “0”). At this point, the discriminator is about as accurate as Tyrion is with a bow and arrow.
  4. The discriminator updates it’s parameters to become better at classifying images.
  5. The generator uses the discriminator as a loss function and updates it’s parameters accordingly, to become better at generating images that look realistic enough to fool the discriminator (i.e., make the discriminator output numbers close to “0”).
  6. The game continues, until both the generator and discriminator reach a point of equilibrium, where the discriminator can no longer distinguish between images created by the generator and images from the dataset.
  7. Gracefully throw away the discriminator, and voila—you now have a generator that generates images, most of which will hopefully not look like garbage.

StyleGAN

The field of deep learning moves fast, and since 2014, there have been more GAN innovations than fan-favorite character deaths on Game of Thrones.

So even if you use the fantastic GAN training framework that I discussed above, your generated images will look like grayscale fried avocados at best.

To truly make GANs work in practice, we need to employ a suite of clever techniques.

If you want to conquer the seven kingdoms of GANseteros, There’s a GitHub repo that lists most of the key GAN innovations (GANnovations?) over the last few years. However, as impressive as it is, unless you have as much time as Aemon Targaryen, you’re probably not even going to get halfway through it.

Source
Source

So instead, I’ll focus on the critical aspects of only one particular model — StyleGAN.

Nvidia’s research team proposed StyleGAN at the end of 2018, and instead of trying to create a fancy new technique to stabilize GAN training or introducing a new architecture, the paper says that their technique is “orthogonal to the ongoing discussion about GAN loss functions, regularization, and hyper-parameters.”

That means that in 2045 when humanity invents the hyper ultra-massive giant insanely BigGAN, what I’m about to show will still work.

elevator with glasses on sides
Photo by Tomasz Frankowski

Now that’s enough small talk. Let me show you why StyleGAN isn’t a waste of your time.

Mapping Network

Typically, the generator network in a GAN would take in a random vector as input, and use transposed convolutions to morph that random vector into a realistic image, as I showed you above.

That random vector is called a latent vector.

The latent vector is sort of like a style specification for the image. It describes the kind of picture that it wants the generator to paint.

If you were describing a potential criminal to a forensic artist, you’d tell him/her a few “features” of the suspect, like the hair color, facial hair, and distance between eyes of the suspect.

Photo by Kelly Sikkema
Photo by Kelly Sikkema

The only problem is that neural networks don’t understand “hair color, facial hair, and distance between eyes.” They only understand CUDATensors and FP16s.

The latent vector is a high-level description of the image in neural-network language.

If you want to generate a new image, you’d have to select a new vector, which makes sense — change the input, and you change the output.

However, that doesn’t work so well if you want to have fine control of the style of the image. Since you don’t have control of how the generator chooses to model the distribution over possible latent vectors, you can’t precisely control the final image's style.

The problem arises since the way a GAN learns to map latent vectors to images needs to be… learned by the GAN. The GAN might not be too happy conforming to human norms.

You could try changing the hair color of your generated face by nudging a number in the latent vector just a little bit, but the output might have glasses, a different skin tone, and might even be a different gender.

This problem is called feature entanglement. StyleGAN aims to reduce it.

Ideally, we’d like to have a neater latent space representation. One that allows us to make small changes to the input latent vector without making the output image/face look drastically different.

The way StyleGAN attempts to do this is by including a neural network that maps an input vector to a second, intermediate latent vector which the GAN uses.

random vector to synthesis network
Source

Specifically, Nvidia chose to use an 8-layer network with a 512-dimensional vector as the input, and a 512-dimensional vector as the output. Note, however, that these choices are arbitrary, and you can use your hyperparameters if you wanted to.

Hypothetically, adding this neural network to create an intermediate latent vector would allow the GAN to figure out how it wants to use the numbers in the vector we feed it through dedicated dense layers, as opposed to trying to figure out how to use the latent vector directly from transposed convolutions.

The mapping network should reduce feature entanglement (for a complete discussion on why this isn’t just a waste of oh-so-precious compute, I’d encourage you to read the official StyleGAN paper).

If this idea is not very intuitive to you, don’t worry. All that matters is that by doing this whole “mini neural network to map input vector to intermediate latent vector” thing works well, so we’d rather do it than not.

We now have a mapping network that allows us to use the latent space more effectively. That’s great. But there’s a lot more we can do to make the style control even better.

Adaptive Instance Normalization (AdaIN)

Going back to the forensic sketch artist analogy, think about the process of actually describing the suspect.

You wouldn’t just say something like: “So hey, there was this tall skinny dude with a big red beard. He robbed a bank and stuff. But anyway, I’ve got a TV show to catch up with so I’ll catch ya later officers, have a good one” and pack your stuff and leave.

Man running
Photo by Andy Beales

No. You’d stick around for a bit and describe the suspect, wait for the artist to sketch something up, provide more details, wait for the artist, provide more details, and the cycle continues until the two of you can collaborate an reach an accurate recreation of the suspect’s face.

In other words, you, the source of features and information (i.e., the latent vector), would repeatedly inject information into the artist, the person who renders the description into a visible, tangible thing (i.e., the generator).

However, in the traditional formulation of GANs, the latent vector doesn’t “stick around for long enough.” Once you feed the latent vector into the generator as an input, it is never used again, which the computational equivalent of packing your bags and leaving.

The StyleGAN model fixes this problem by doing exactly what you’d expect — it makes the latent vector “stick around” longer. By injecting the latent vector back into the generator at every layer, the generator can keep referring back to the “style guide” in the same way that the artist can keep asking you questions.

Photo by Thiago Barletta
Photo by Thiago Barletta

Now, let’s get into the technical difficulties.

It’s all neat and simple in the analogy world, but disrespectful TV addicts and skinny red-bearded bank robbers don’t translate into mathematical equations.

“So how exactly does StyleGAN inject the latent vector back into the generator at every layer,” you might ask.

“Adaptive instance normalization, ” I shall most graciously respond.

AdaIN (Adaptive instance normalization; come on did I seriously need to expand that) was a technique that was initially used in style transfer but made its way into StyleGAN.

AdaIn uses a linear layer (more accurately called a “learned affine transformation” in the original paper) that maps the latent vector onto two scalars, which we’ll call $y_s$ and $y_b$.

The “s” stands for scale, and the “b” stands for bias.

Once you have those scalars, here’s how you perform AdaIN:

$$y = (y_s,y_b) = f(w)$$

$$\text{AdaIN}(x_i,y) = y_s\frac{x_i - \mu(x)}{\sigma(x)} + y_b$$

Here, $f(w)$ represents a learned affine transformation, $x_i$ is an instance that we are applying AdaIN to, and $y$ is a set of two scalars $(y_s,y_b)$ that control the “style“ of the generated image.

This might seem very familiar if you’ve used BatchNorm before, and that’s intentional. One key difference, however, is that the mean and variance are computed per-channel and per-sample, rather than for an entire mini batch, as shown below:

Source

$$\mu_{nc}(x) = \frac1{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}x_{nchw}$$

$$\sigma_{nc}(x) = \sqrt{\frac1{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}(x_{nchw}-\mu_{nc})^2 + \epsilon}$$

This way of infusing styles into the hidden layers of the generator might seem strange at first, but recent research has shown that controlling the gain and bias parameters (i.e., $y_s$ and $y_b$ respectively) in the hidden layer activations can drastically affect the quality of style transfer images. So roll with it.

By doing all of this normalization stuff, we’re able to inject style information into the generator in a much better way than just using an input latent vector.

The generator now has a sort of “description” of what kind of image it needs to construct (thanks to the mapping network), and it can also refer to this description whenever it wants (thanks to AdaIN).

But we can still do more.

Learned Constant Input

If you’ve ever tried one of those “draw a Disney character in just 5 simple steps” and undoubtedly failed, you know that they all start with that creepy slender-man outline thing.

Source
Source

Notice that you can make a bunch of different character faces using the same baseline skeleton, and slowly add on finer details from there.

The same idea applies to the forensic artist. He/she likely has a reasonably good idea of what a human face roughly looks like, even without you specifying any details at all.

Recall that in the traditional generator network of the GAN, we feed in a latent vector as input, and use transposed convolutions to map that latent vector to an image.

The reason we needed that latent vector is so that we can provide variation in our generated images. By sampling different vectors, we get different images.

If we use a constant vector and map that to an image, we’ll get the same image every single time. That would be pretty boring.

Photo by Jonny Clow
Photo by Jonny Clow

However, in StyleGAN we already have another way of dropping in stylistic information into the generator — AdaIN.

So why do we even need a random vector as input when we can learn it? Turn out we don’t.

You see, in the regular GAN, the only source of variation and stylistic data was the input latent vector that we don’t ever touch again. But as we’ve seen in the previous section, this is pretty weird and inefficient, since the generator can’t “see” the latent vector ever again.

StyleGAN corrected this by “injecting” the latent vector into each layer through adaptive instance norm, solving many problems. That has another side effect -- we don’t need to start with a random vector, and we can learn one instead since any information that could be provided will be provided by AdaIN.

To be more specific, StyleGAN chooses a learned constant input that is a $4\times 4 \times 512$ tensor, which you can think of as a $4\times4$ image with $512$ channels. Note again, that these dimensions are entirely arbitrary, and you can use whatever you want in practice.

Source
Source

The rationale behind this is the same as the Disney princess drawing circles thing — the generator can learn some idea of a rough “skeleton” that is standard to all images so that it can start from a blueprint rather than from scratch.

So for the most part, there you have it, folks. That’s StyleGAN. In practice, there are a few other neat tricks that you could employ to make you generated images look a tad more realistic.

If you’re not too concerned about these details, congratulations 🎉 , you now understand the core of one the most innovative take on GANs in the entire universe (GANiverse? God, I should really stop coming up with these puns).

But if you want the absolute best images, and a look at what Jon and Daenerys’ child might look like, keep reading.

Style Mixing

Remember how I told you that we inject the latent into each layer individually?

Well, what if we didn’t just inject one latent vector, but two? 🤔

Think about it. We have a lot of transposed convolution and AdaIN layers in our generator (18 in Nvidia’s implementation, but that’s completely arbitrary). At each AdaIN layer, we independently inject a latent vector.

Source
Source

So if the injection into each layer is independent, we could inject different latents into different layers. 💡

If you thought that was a good idea, well, Nvidia thought so too. Using their fancy GPUs,  the team tried using different latents corresponding to different “people” at different layers.

The experiment was set up like this: take 3 different latent vectors, which, when used individually, would generate 3 realistic human faces.

Then, they injected these vectors into 3 different spots:

  1. At the “coarse” layers, where layers where the hidden representation is spatially small — from $4 \times 4$ to  $8 \times 8$.
  2. At the “medium” layers, where layers where the hidden representation is medium sized — from $16 \times 16$ to  $32 \times 32$.
  3. At the “fine” layers, where layers where the hidden representation is spatially small — from $64 \times 64$ to  $1024 \times 1024$.

You might think “well, geez, those fine layers sure do occupy a vast majority of the layers. From 64 to 1024? That’s a lot. Shouldn’t the spacing be more even?”

Well, not really. If you’ve read the ProGAN paper, you know that the generator quickly picks up on information and the larger layers mostly refine and sharpen the outputs from the previous layers.

Then, they tried moving the three latent vectors around a bit from initial spots and see how the resulting image changes qualitatively.

Here are the results:

The results feel very accurate, in a very creepy way. But hey, it works.

Stochastic Noise

After all the cool things that Nvidia did with StyleGAN, sorry to disappoint you, but I most certainly did not save the best for last.

After generating fake faces and even mixing them in novel ways, what if you found one that you like?

You could generate a hundred copies of the same image, but that would be pretty boring.

So instead, we would like to have a few variations of the same image. Perhaps variants with slightly different hairstyles, or more freckles. Minor changes like that.

images generated using GAN
Source

Of course, you could do it the normal GAN way, by introducing some noise into the latent vector like this:

$$\text{face that I like} = G(x)$$
$$\text{variant of face that I like} = G(x + \epsilon)$$

where $G$ is the generator, and $\epsilon$ is a vector whose components are small numbers that are sampled randomly.

But we have StyleGAN, and as the name implies, we have control over the image style.

So just like how we did layer-wise injections for latent vectors, we can do the same for noise. We can choose to add noise at the coarse layers, middle layers, fine layers, or any combination of the three.

The StyleGAN noise was added pixel-wise in the paper, which makes sense, because this is the more common and natural way noise was historically added to images, rather than perturbing the latent vector.

Source
Source

There were some  interesting stylistic results from playing around with the noise (God does the awesomeness of this paper ever end?):

While generating Game of Thrones characters, I had no use for noise, since I was looking to generate just a few high-quality images. But it’s nice to see that the research team has thought about that.

StyleGAN-ing Your Favorite Game of Thrones Characters

Now that you understand how StyleGAN works, it's time for the thing you've all been waiting for–Predicting what Jon and Daenerys' son/daughter will look like.

Without any further ado, I present to you Djonerys, first of their name:

stylegan application on GOT characters
In case you didn't know, he's the dude on the bottom right

The future protector of the realm you're looking at was generated using the style mixing technique, which was discussed above.

Finally, in celebration of 8 years with the hero of Westeros, let's pay homage to Jon Snow's growth over the years.

Jon Snow through the ages - generated by StyleGAN

There are a lot of other things you can do once you have latent representations of the characters, like creating a kid Khal Drogo or making Jaime a woman.

So have a go at it. The seven kingdoms are yours.

Try in a notebook

Lazy to code, don't want to spend on GPUs? Head over to Nanonets and build computer vision models for free!