Activation Functions for Neural Networks
In the previous article - Optimization Methods, we saw some approaches to get the optimal values of the parameters W and b that are weights and bias. But what about layer-wise results. Which entity is responsible for getting the best result after each hidden layer or the output layers.
In this article, we will be talking about Activation Functions that are used to get the best result possible after each layer.
What is an Activation Function?
An activation function controls how a neuron should be fired or even it should be fired or not. It defines how the parameters W and b (Weights and biases) are forwarded (transformed) from a layer to the next layer. There are many activation functions to choose from and mostly they are nonlinear functions. It is very important to select the appropriate activation function as it will have a significant on the output of each layer.
Why is it that important to use the appropriate Activation Function?
So consider this scenario, you have one input layer, eight hidden layers, and a single output layer. Let's assume you select an activation function that is not the best one for your second hidden layer. What would happen?
If you guessed there would be a domino of error then you're correct. It's like solving a math problem step by step. If you make a calculation mistake in the first step, your final answer will be definitely wrong even if you followed the correct steps.
Activation Functions are used to introduce nonlinearity in learning the weights and biases by minimizing the error. It is highly unlikely that your neural network will always get to learn the "perfect" simple data. Hence, for not-so-perfect tasks, activation functions are used.
Here are some activation functions for your perfect and not perfect tasks ;)
Activation Functions in this article -
This is the most simple activation function where the output is directly proportional to your input.
def linear_function(c, x): """ args: c is constant x is the input to the activation function returns: output proportional to input """ return c * x
As it is a linear function, the gradient will be constant and will not be suitable for backpropagation if there are changes in the input "x". It is also called an identity function or no activation
Exponential Linear Unit is a nonlinear activation function that gives faster minimization of the cost function. One of the main advantages of ELU is that it passes the vanishing gradient problem which occurs when there is an insignificant update in weights and biases by a very small gradient. The cost function becomes steady and we don't see any improvements in the further iterations.
def ELU_activation(alpha, x): """ args: alpha is a positive constant x is an input returns: ELU output """ if x > 0: return x else: return alpha * (e ** x - 1)
Though ELU is better than ReLU in some cases, for x > 0, it performs like linear activation and can fail in the cases where linear fails.
Rectified Linear Unit or ReLU is the most common activation function in deep learning because it is simple to implement and overcomes the challenge of vanishing gradient as discussed earlier. It has a very simple output equation and looks linear but it actually isn't. It takes the max of 0 and the input to be given to the next layer. It performs better than the Sigmoid function.
def ReLU(x): """ returns: max of x and 0 """ return max(0.0, x)
ReLU is a nonlinear activation function and hence can be used for backpropagation of errors and we can activate multiple neurons. It is also a faster activation function due to its simple function as compared to sigmoid.
It is the "perfect" activation function we are looking for which is both nonlinear and has a continuous derivative. But being perfect does not mean it is the best activation function. The input is a real value and the output resides between 0 and 1. The higher the positive input, the closer the output will be to 1.0 and vice versa.
def sigmoid(x): """ args: x as a real value returns: a value between 0.0 and 1.0 """ # You can also use math.exp for exponent return 1.0 / (1 + e ** (-x))
As the function is complex, the time complexity is higher and hence the function is heavy. As the gradient is smooth and definite that is between (0, 1), there is no chance for loss of activations. Sigmoid is best used for classification purposes.
Hyperbolic tan or tanh function is of the same shape as sigmoid function. Basically, it is very similar to it because it also takes in real value. But there won't be two different activation functions that do the exact same thing. Hence, the difference is that sigmoid gives an output between (0, 1) whereas, tanh produces output between (-1,1) and a small looking but a big difference is that tanh passes through the origin and sigmoid does not.
from math import exp def tanh(x): """ returns: ouput between (-1, 1) """ return exp(x) - exp(-x) / exp(x) + exp(-x)
Which Activation function should I use?
To explain in brief, the selection of activation function depends on the architecture of your network whether it's a single layer perceptron, multilayer perceptron, or a CNN.
Here is a general trend for selecting the activation function.
- Multilayer Perceptron - ReLU or Leaky ReLU
- CNN (Convolutional Neural Networks) - ReLU
- RNN (Recurrent Neural Networks) - tanh or sigmoid
This trend does not mean your results would perform best. You have to experiment with using the activation function and asking do you need to use it after every hidden layer or only for some. This will ensure the best results for your network.
That's it for this article. Hope everything is crisp clear like what is an activation function, why is it used, what are the different activation functions and how to select them. Thank you so much for reading the article. Cheers :)