– What is an activation function? Discuss the different types and their pros and cons
– What is Rectified Linear Unit (ReLU) activation function? Discuss its advantages and disadvantages
A zero-centered output for an activation function is preferred for several reasons:
- Zero-centered output helps networks train faster as they bring the gradient closer to the natural gradient. Zero-centered activations ensure that the mean activation value is around zero, preventing them from becoming too small (vanishing gradients) or too large (exploding gradients). This contributes to smoother and faster convergence during training. Optimization algorithms like gradient descent tend to work more efficiently when gradients are centered around zero.
- Bias mitigation: Units that have a non-zero mean activation introduce a bias shift in activations, which can affect how the network learns. A zero-centered output helps mitigate this bias, ensuring that neurons start with balanced initial activations. Balanced activations can lead to more stable and unbiased updates to the model’s weights during training.
- Symmetry and Weight Initialization: Zero-centered activation functions exhibit symmetry around the origin (i.e., f(0) = 0). This property simplifies weight initialization because it ensures that neurons start with similar, balanced initial activations, which can lead to more stable and efficient training.
However, please note that while zero centered property is important, it is not necessary. Non-zero activation functions (such as ReLU) are still commonly used in neural network training. They are usually combined with the Normalization layers such as batch norm, layer norm, weight norm etc. where the data is normalized in advance to be zero-centered before passing through an activation function.