Understanding Autoencoders - An Unsupervised Learning approach

This article covers the concept of Autoencoders. Concepts like What are Autoencoders, Architecture of an Autoencoder, and intuition behind the training of Autoencoders.
TKTejas Khare24.00
May 12, 2021

Neural Networks can learn the patterns in the data provided and predict a result on a given additional data. We can use them in predicting whether a person will get a loan from a bank after giving various considerable factors which we call features of the data like income, credit score, debt, assets, expenses, etc. In this example, going with Logistic Regression will be smart because we need to get a prediction that tells us whether the person will get the loan or not.

But can we use neural networks to produce an output and not predict something on a given data?

Fortunately, yes we can. Here comes the role of Autoencoders that look as beautiful as they are used in these sorts of problems. Autoencoders have the power to produce an image of a dog just by defining the "prescription" of a dog to our Autoencoder. Generative modeling is a newer concept concerning simple neural networks or convolutional neural networks and is somewhat underrepresented.

Hence, in this article, I would like to shed some light on the generative model - Autoencoder.


Autoencoders are a type of unsupervised learning technique used primarily for getting a representation of a given input data. The input data can be in the form of an image, a text, a speech, or even a video which is nothing but sequential images or frames. The Autoencoder will try to find their closest match possible which will be a compressed transformation of the input data.

You might have guessed already that autoencoders use the concept of encoding somewhere in the process. Well, that's not a hard thing to guess, you are absolutely right. An Autoencoder takes in data and encodes it automatically, thanks to its architecture. But we cannot use the encoded output from the network. Hence, we also need a decoder to convert or transform the encodings into a useful format. The decoder output should be similar to that of input and that will be considered a good generative autoencoder.

In mathematical terms, if you give an input of a shape of (256, 256, 3), then the output of the autoencoder should be (256, 256, 3). They compress the input data into a lower dimension, try to learn the features, and finally try to get the closest representation of the input data from the learning.

Note: The compression and will only be suitable for the type of data on which it has been trained on. For example, if the autoencoder is trained on images of cars, you cannot expect the autoencoder to generate a bicycle. ;) Autoencoders do not need labels to be passed but that does not mean they don't need one. They generate the labels from the data from the encodings.

The Architecture of an Autoencoder

Now let's see the beautiful architecture of an autoencoder that I was talking about earlier. It contains a bottleneck type structure that compresses the input data and finds the correlations between the features.

There are three layers in autoencoders -

1. Encoder

This layer is responsible for reducing the dimensions of the input data to minimize the computation required to learn the patterns and features of the data. This layer can be a single or multiple neural network layer. For example, we pass an input image of size (256, 256, 3). The first layer reduces it to (128, 128, 32) by giving it a kernel size of 16 and stride 1. This layer will learn some representation. The further layer will give a dimension of (64,64,64) by giving it a kernel size of 64 and stride 2 (You can check the mathematical formula to find the dimension here). This layer will have a combined or a lower-level representation and so on.

2. Latent Space

The latent space is called the lower dimensional code or the lower-dimensional representation that acts as an intermediate step.

3. Decoder

This layer is responsible to transform the encoding back to the input dimension and produce the most similar or closest generation. There is no target value and therefore it's an unsupervised technique. This layer also can contain single or multiple neural network layers which work oppositely as the encoder neural networks do. The decoder contains transpose convolutional layers that decode the encodings, that is increase the dimensions. Taking the example from the encoder, the output of the decoder will be a (256, 256, 3) image (You can check the mathematical formula to find the dimension of the transpose convolutional network here). 

How to train an Autoencoder?

In the above image, you can see that the autoencoder has 6 neurons in the encoder and decoder part that means the 6 features are represented with 4 neurons on either side. Let's call our generated output Y' and input as Y.

The intuition behind training -

  1. We feed the input Y to the autoencoder
  2. We obtain the generated output from the autoencoder Y'
  3. Let's assume L as the loss for the generation, which is the absolute difference between Y and Y' represented as -

L =

In the scenario of images, L is the difference between the pixels of Y and Y'. But this cannot produce great results as the learning cure will not be smooth and hence the gradient loss or vanishing gradient problem. Hence, instead of using Mean Absolute Error (MAE) as our cost function, we might as well use Mean Squared Error (MSE) -

Here we can call the MSE our Reconstruction or Generation loss L'. Hence, the goal of the Autoencoder is to find the optimal values of the parameters - weights 'W' and biases 'b' so that it can minimize the Reconstruction loss.

And that's it for this article. I hope you found this article helpful. Thank you and Cheers :)

2 votes
How helpful was this page?