Understanding of Regularization in Neural Networks
Ever had a chance to experience an unusual behavior in your neural network predictions? If you guessed I am talking about overfitting and underfitting, yes you are absolutely right.
A Neural Network sometimes memorizes its training data and basically, it won't perform for a different test data. As you can see in the above graph, the curve passes through almost every sample which in simple terms means that it has memorized the data. In technical terms, overfitting occurs when a network tends to predict only on the training data and fails to fit for additional data.
Contrary to overfitting, underfitting performs poorly on even the training data and also cannot fit on additional data. It happens when a network is unable to learn the patterns of the input training data because the network might not be complex or suitable enough for that particular data. Talking in professional terms, having an underfit model can incur a huge amount of loss in terms of money. Imagine a face attendance system is expected to detect a face but instead it's detecting something else and classifying that as a face.
But the main highlight of this article is about overcoming overfitting.
So how can we eliminate the painstaking problem of overfitting you must ask?
We can use a technique called Regularization for this particular problem. So what it does is, alters the training algorithm slightly so that the model fits better on additional data and slightly underfits on the training data.
As we saw that underfitting occurs when a model is not complex enough. So just like that regularization makes the model a little bit simpler and aids for better generalization.
Regularization techniques to overcome overfitting -
- Data Augmentation
- L1 Regularization
- L2 Regularization
- Early Stopping
1. Data Augmentation
In training your network, data that is fed to it can have a huge impact on how your model performs after the training. If your training data is insufficient, then it would likely result in underfitting, and if your data has similar patterns all over the corpus or dataset, it would likely result in overfitting. Though data augmentation does not alter your network it is a good technique to boost the performance of your network.
It is nothing but increasing the size of your dataset that is for example if you have images of a dog and a cat. There are a total of 1000 images of dogs and 100 images of cats. Your model will likely overfit to dog images and perform very poorly for cats. In this case, you can use data augmentation or increase your number of cat images by different augmentation techniques like -
- Flipping (180 degrees)
- Rotation (You can choose any relevant rotation degree you want)
- Zooming in
- Zooming out
- Adjusting the properties of the image like brightness, sharpness, etc
- Remove low probability words (in case of text data)
Below is an example of data augmentation of a dog image. As you can see one image is augmented to 12 different images.
2. L1 Regularization
L1 Regularization or the Lasso Regression estimates the median of the data. Consider the loss term L(x, y) -
where the inside red box represents a regularizing term. The job of this term is to keep making the weights smaller (can be zero) and hence simplifying the network. Hence, it helps in avoiding overfitting.
It is used when the number of features we want to extract is high because it gets the absolute value and features with zero coefficients are ignored.
#Sample of adding L1 import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras import regularizers dense = tf.keras.layers.Dense(3, kernel_regularizer='l1')
3. L2 Regularization
L2 Regularization or Ridge Regression uses the squared absolute of the coefficient term unlike the L1 regularization which uses the absolute value.
It is also known as weight decay as it pushes the values of the weights towards zero (but not zero).
#Sample of adding L2 import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras import regularizers dense = tf.keras.layers.Dense(3, kernel_regularizer='l2')
Dropout is a type of regularization that minimizes the complexities of a network by literally dropping or ignoring neurons that are not useful in the training. The neurons are dropped out randomly which temporarily hampers their ability to update or have an impact on the weights. This is a very common regularization technique in deep learning models.
Note: It is used only while training not when evaluating the model.
So how many nodes should I drop?
The number of nodes or neurons to be dropped is determined by giving a probability. The probability is basically a hyperparameter that is to be passed during the training.
Hence, that's why I mentioned that Dropout is preferred for deep neural networks because you can literally produce so much randomness that would in turn benefit your network.
from keras.layers.core import Dropout from keras.layers import Dense # After a Sequential layer like Dense you can use a dropout which takes # an argument of float which is nothing but the hyperparameter probability model.Sequential([Dense(hidden1_num_units, input_num_units) Dropout(0.5) ....... ])
5. Early Stopping
It is good to train your model for a long time. But hear me out, there's a catch. The training time should not be so long that the model enters the overfitting stage.
So if we treat the epoch as a hyperparameter, we can cross-validate on which epoch the training results are better. But isn't this a laborious task too?
Yes, it is, because just imagine getting all the results on multiple epochs first and then selecting the appropriate epoch and then training again. Hence, instead, a technique called Early Stopping is used.
1. Before starting the training, just divide your data into training and validation data. While the training is in progress, also check the accuracy with the validation data and if the accuracy degrades at a point, then consider the epoch before that as your best epoch. There is a simple callback for this technique in Keras -
from keras.callbacks import EarlyStopping """ args: monitor = The argument which denotes the quantity that is to be monitored. patience = The argument which denotes the number of epochs after which the accuracy degrades. The training will be stopped after this number of epochs after the degrade epoch. """ EarlyStopping(monitor='val_err', patience=3)
2. Another simple technique that I use most often while training very complex networks are storing checkpoints. Checkpoints storing is also a small part of early stopping but you don't stop the training, you select the best result in a single go. The checkpoints are nothing but the weight files which are produced after the training of an epoch. You can also store the checkpoints with some conditions like at every epoch which is a multiple of 10 when the max number of epochs is 100 (just an example). Here is a simple way to store a checkpoint with TensorFlow Keras -
import tensorflow as tf checkpoint_filepath = '/tmp/checkpoint' model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_filepath, save_weights_only=True, monitor='val_accuracy', mode='max', save_best_only=True)
There it is! Thank you for reading the article, I hope this helped you in understanding what is regularization, why is it important and what are some techniques of regularization. Cheers :)