Variational AutoEncoder: Explaining KL Divergence

6 min readJun 4, 2021

Ahlad Kumar’s YouTube channel

If you were on YouTube trying to learn about variational autoencoders (VAEs) as I was, you might have come across Ahlad Kumar’s series on the topic. In his second video (embedded above), he explained KL divergence which we will later see is in fact a building block of the loss function in the VAE. This article aims to bridge ideas in probability theory as you may have learnt in school to those in the video. We will then re-look at the proof for KL divergence between 2 multivariate Gaussians (a.k.a normal distributions).

Note: This topic requires knowledge of joint probability distributions as covered in MITOCW 18.05 and basic matrix operations (namely multiplication, transpose and trace) as covered in MITOCW 18.06.

Continuous vs Discrete

Recall that there are 2 types of random variables: continuous and discrete. While continuous random variables have a probability density function (PDF), discrete random variables have a probability mass function (PMF). It is important to keep this distinction in mind when we are finding expectation and variance.

The expectation of a discrete random variable is defined with a summation. The expectation of a continuous random variable is defined with an integral. There was a small mixup over this in the video which might have caused some confusion (although the calculations were ultimately not affected). We will revist this later.

PMF & CDF

A typical test question in school might ask you to find

In other words, find the probability that the discrete random variable X is 1. This understanding might be so intuitive that we might just take P to stand for probability. But it isn’t. Sure, the expression resolves to a probability but the P really is just a label for the probability mass function and is used primarily as a convention. Where there is another probabilty distribution in question, Q may be used instead as we will see in the definition of KL divergence.

To further exemplify,

when the cumulative density function (CDF) of continuous random variable X is f(x), it is represented by P, however, where X has a different CDF g(x) the distinction is denoted by the use of a different letter Q.

Law of the unconscious statistician (LOTUS)

This is a theorem defined as follows

This theorem gives us the following useful results

It likely that you would only ever need to use these results in solving test problems hence you might not have remembered the theorem itself.

I recommend the following video if you wish to get some intuition behind LOTUS, otherwise you may just accept it as is.

Intuition behind LOTUS

Expectation Notation

We can rearrange the terms in LOTUS to get

Observe that the P(X = x) term is there for any argument g(x) in the expectation.

Where there might be multiple probability mass functions in a problem, we can use a subscript in the expectation to denote which probability mass function the random variable has.

By putting P in the subscript of the expectation, it is instantly made clear what our expression should be.

Here’s another example. How would you resolve the following?

I hope that it is instantly clear that this resolves to

where, once again, Q in the subscript denotes that the X here has a different probability mass function.

One last example just so we are real clear. How would you resolve this?

The answer is

One last thing, I promise

For the sake of convenience P(X = x) may be expressed simply as P(x)

Similarly, Q(X = x) may be expressed as Q(x).

Great! Now we can return to KL divergence.

KL divergence is an expectation

KL divergence comes from the field of information theory. It is cross entropy minus entropy. Both of these terms also come from information theory. To understand KL divergence from an information theory POV, I recommend this video

followed by this article by Naoki Shibuya, KL Divergence Demystified. Otherwise, the point that I want to bring across here is that KL divergence is an expectation.

Here, we see that KL divergence as expressed in discrete summation form in Kumar’s video can also be expressed in its original expectation form. And as we have seen at the start of this article, we need to take into the account the distinction between discrete random variables and continuous random variables where expectation is involved.

In the discrete case, KL divergence is expressed with a summation (as above) but in the continuous case, it is expressed with an integral.

Credit: Naoki Shibuya, KL Divergence Demystified

In Kumar’s video, he is trying to prove the KL divergence between 2 Gaussians which are continuous distributions. Therefore, theoretically speaking, we should be using the integral form and not the summation form. However, as we will see, it does not affect the result of his calculations. I can only guess that he was doing so for the sake of convenience so that he wouldn’t have to handle the dx at the back of the expression.

The logarithmic in KL divergence

In Kumar’s video, the natural log (ln on the calculator) is used in the KL divergence. This is because there is an exponentional in the definition of a multivariate normal distribution. Using ln, allows the use of a certain logarithmic property. However, there is no specification to which base of the logarithm has to be used. It only affects the absolute scale. Source

Proving KL divergence between 2 Gaussians

I think Kumar has done a superb job in his proof so I’m not even going to try and do a better job over this medium (no pun intended). What I do want to do in this blog post is highlight some potential areas of confusion that hopefully will be cleared by the ideas introduced earlier.

Base of logarithm

Here the log uses a natural base. This allows us to use the following logarithmic property

2. Should use integral

We are dealing with continuous distributions so we should be using an integral in KL divergence.

3. Expectation notation

This should be clear by now. It is exactly as what we have covered.

All right that’s all I have for you. Just also be careful not to mix up μ with the letter p. The handwritting was not very clear on some slides. If you have any other questions regarding the video or this blog post, be sure to drop a question in the comments, otherwise, I hope the video accompanied by this blog post helped you fully understand the proof for KL divergence between 2 Gaussians.