For a complete understanding of Neural Network, check out the post: Basic architecture of Neural Network, Training Process, and Hyper-parameters tuning
Mathematical model of a Neuron
An activation function is a mathematical function applied to the output of a neuron (or node) in a neural network. When a neural network receives input data and processes it through its layers, each neuron computes a weighted sum of the inputs, adds a bias term, and then applies an activation function to produce the neuron’s output. This output is then passed to the next layer of neurons as input. Using a simple neural network example, the role of activation function is illustrated in the figure below:
A simple neural network
Activation functions serve two main purposes:
- Introduce Non-linearity: Without non-linearity, a neural network would behave like a linear model, no matter how deep it is. Activation functions allow the network to learn and represent complex, non-linear mappings between inputs and outputs.
- Control Neuron Activation: Activation functions control the firing behavior of neurons. Depending on the activation function’s output, a neuron might become activated (output a non-zero value) or remain inactive (output zero).
Typically, same activation function is applied to all the hidden layers, while the output layer uses a different activation function, based on the type of prediction model aims to make. How to choose an activation function is explained later in this article.
Different types of Activation Functions:
Activation functions, which are popularly used in neural network models, are shown in the figure below.
Other activations functions are Parametric ReLU (PReLU) (parametric version of Leaky ReLU), Scaled Exponential Linear Unit (SELU), Swish, Hard Swish, and Gaussian Error Linear Unit (GELU) Activation.
Pros and Cons of different Activation Functions
Which activation function to choose from?
Activation function for hidden layers:
In practice, the choice of activation functions is as follows in order of priority:
- ReLU is the top choice as it is simpler, faster, much lower run time, better convergence performance and do not suffer from vanishing gradient issues
- Leaky ReLU, PReLU, Maxout and ELU
- Sigmoid (not preferred anymore)
Activation function for output layers:
- Sigmoid for binary classification, multi-label classification
- Softmax for multi-class classification
- Identity / linear function for regression problems
Some key terms to know w.r.t Activation functions:
In the context of neural networks, saturation refers to a situation where the output of an activation function or neuron becomes very close to the function’s minimum or maximum value (asymptotic ends), and small changes in the input have little to no effect on the output. Saturation becomes a critical issue in neural network training as it leads to the vanishing gradient problem, limiting the model’s information capacity and its ability to learn complex patterns in the data. When a unit is saturated, small changes to its incoming weights will hardly impact the unit’s output. Consequently, a weight optimization training algorithm will face difficulty in determining whether this weight change positively or negatively affected the neural network’s performance. The training algorithm would ultimately reach a standstill, preventing any further learning from taking place.
Related Question: What do you mean by vanishing gradient and why is that a problem?
Vanishing gradient refers to a problem that can occur during the training of deep neural networks, when the gradients of the loss function with respect to the model’s parameters become extremely small (close to zero) as they are backpropagated through the layers of the network during training. This leads to impairment in learning in deep neural networks (DNN). When the gradients become too small, it means that the model’s weights are not being updated effectively. As a result, the network’s training may stagnate or become extremely slow, making it difficult for the network to learn complex patterns in the data.
Related Question: Why is Zero-centered output preferred for an activation function?
Optimization algorithms like gradient descent tend to work more efficiently when gradients are centered around zero. Zero-centered activations ensure that the mean activation value is around zero, preventing them from becoming too small (vanishing gradients) or too large (exploding gradients). This contributes to smoother and faster convergence during training. In addition, zero-centered output helps with bias mitigation by ensuring that neurons start with balanced initial activations. Balanced activations can lead to more stable and unbiased updates to the model’s weights during training. However, please note that while zero centered property is important, it is not necessary.