Variational Autoencoders

A variational autoencoder (VAE) resembles a classical autoencoder and is a neural network consisting of an encoder, a decoder and a loss function. They let us design generative models of data and fit them to large data-sets, and can also be used for image generation and reinforcement learning. For example, a practical application may be to generate trees for a forest in a video game, which are all similar but not the same. They are a recent advancement in machine learning, having only been defined in 2013. However, they solve a long standing problem in machine-learning and do so with weak assumptions and fast training via back-propagation, which explains their fast rise in popularity [1].

The first layer of the VAE is the encoder which will take the input and convert it into a latent vector. This could be done by reducing the mean squared error of the input and output, like a standard autoencoder. With images for example, we can now represent something like a picture of a cat as the vector [1.9, 8.2, 2.1]. The vector is called latent because given just an output from the model, we don’t necessarily know which settings of the variables in the latent vector generated this output, without inferring it using something like computer vision.

However to make the VAE a generative model, we must add a constraint on the encoding network that forces it to generate latent vectors that roughly follow a Gaussian distribution. This is the key feature of variational autoencoders, and allows the user to generate an output similar to the database the VAE was trained on by inputting a latent vector straight to the decoder. The problem now, is to make the network’s latent variables match the unit Gaussian distribution as closely as possible while also accurately providing an output similar to the input.

For a mathematically simplified explanation, this is done by changing our loss term to be the sum of the mean squared error and a latent loss. The mean squared error as usual, allows us to measure how accurately the network reconstructs images. The latent loss however, is the Kullback-Liebler divergence (KL divergence), which can measure how closely the variables match a unit Gaussian distribution. The encoder can now be changed to generate a vector of means and a vector of standard deviations, rather than a vector of real variables. From this the KL divergence can be calculated.


Figure 1: Simplification of layers in a variational autoencoder [2].


[1]   Doersch C. Tutorial on Variational Autoencoders;. Available from:

[2]   Frans K. Variational Autoencoders Explained;. Available from: