Normalization in Deep learning
Deep learning is an exciting field in Artificial intelligence, it is at the forefront of the most innovative and exciting fields such as computer vision, reinforcement learning, and natural language processing. Deep learning has a complex architecture, which comes with some problems. These deep neural networks have tons of layers, which are difficult to train because they are responsive to the learning algorithm's initial random weights and configuration.
In a deep neural network, there is a phenomenon called internal covariate shift, which is a change in the input distribution to the network's layers due to the ever-changing network parameters during training.
The input layer may have certain features which dominate the process, due to having high numerical values. This can create a bias in the network because only those features contribute to the outcome of the training. For example, imagine feature one having values between 1 and 5, and feature two having values between 100 and 10000. During training, due to the difference in scale of both features, feature two would dominate the network and only that feature would have a contribution to the outcome of the model.
Due to the reasons stated, a concept known as normalization was introduced to resolve these issues.
When the features in the data have different ranges, Normalization is an approach used during data processing, to adjust the values of numeric columns in a dataset to a similar scale. Normalization has a lot of advantages, which includes
- Reducing the internal covariate shift to improve training
- Scaling each feature to a similar range to prevent or reduce bias in the network
- Speeding up the optimization process by preventing weights from exploding all over the place and limiting them to a specific range
- Reducing overfitting in the network by aiding in regularization
From the figure above, we can see the benefits of normalization in action. The accuracy of the model is not improving after every epoch because there is a big disparity in the numerical values of the features, but when the features are normalized, the accuracy tends to go high after every epoch in response. There are several flavors of normalization developed over the years, which include:
- Batch normalization
- Layer normalization
- Instance normalization
- Group normalization
Each technique would be looked at in detail in the following sections
Batch normalization is the most common form of normalization in deep learning. It scales the inputs to a layer to a common value for every mini-batch during the training of deep neural networks. This stabilizes the learning process and significantly reduces the number of epochs required to train deep networks, enabling the network to train faster.
The way batch normalization works are by calculating the mean and variance of every feature in the mini-batch, then the mean is subtracted and each feature is divided by the standard deviation of the mini-batch. This can be represented mathematically as :
Above, we mentioned that we find the mean and the variance, then normalize, but there's a new equation that represents the scale and shift. The reason for the scale and shift is to introduce learnable new parameters, in case the network performed better when the weights were increased in magnitude. Even though this is the most popular normalization in deep learning, it carries with it, some downfalls, in that, with a batch size of 1, the variance would be 0, which defeats the purpose, and batch normalization wouldn't work. Moreover, if the mini-batch is small, the noise would be introduced and training would be disturbed
A different form of normalization is layer normalization, which was implemented to take care of the drawbacks of batch normalization, particularly it's dependent on the mini-batches. This approach normalizes the summed input across the features as opposed to normalizing the input features across the mini-batches, as in the case of batch normalization. Mathematically, layer normalization can be represented as:
it refers to the batches and j refers to the features, so as can be seen from the equation, the features are also normalized and not just the batches. Layer normalization can be applied to recurrent neural networks by normalization separately at each time step.
Instance Normalization holds some similarity to the previously discussed layer normalization technique, with the difference being, the instance normalization normalizes across each channel of the training data, whiles the layer normalization normalizes across the input features.
Layer normalization is also used in computer vision to normalize a whole batch of images instead of the one as seen in the batch normalization technique. Instance normalization can also be used on the test data because it doesn't depend on the mini-batch. This technique can be represented as:
k and j represent the image height and width, "i" represent the channel(if the input is RGB, then it is a color channel), the index of the image in the batch is represented by t, all denoted by the stick. The main takeaway from the equation is the introduction of the channel, which is where this technique performs its normalization. Instance normalization plays a big role in the style transfer of the generative adversarial network. Instance normalization is sometimes called contrast normalization.
Group Normalization divides the channels into groups and normalizes them for each training example. This normalization can be represented as instance normalization by putting each channel into different groups and can be represented as layer normalization by putting all channels into a single group, so group normalization can be said to be a hybrid. Group normalization also doesn't depend on batch sizes as we already know is the case in batch normalization. Mathematically, group normalization can be represented as:
x is the feature and "i" is the index.
N is the batch, C is the channel, H and W as usual are the height and width. G refers to the number of groups, with C/g indicating the number of channels in a group. Group Normalization computes the mean and variance along the height and width axes and along with a group of a number of channels. Group normalization has the capacity to replace batch normalization for a number of tasks.
Sometimes for some people, the types of normalization can get confusing and they can mistake one for another. Hopefully, the image below can clear some doubts and offer some clarity as to how each of the techniques performs their normalization.
With N,C,H,W referring to the batches, channels, height, and width respectively, Batch normalization normalizes in the N direction, layer normalization and group normalization normalizes in the C direction but group normalization as the name suggests divides the channels(C) into groups and normalizes them separately.