Related Questions:
– Explain the basic architecture of a Neural Network, model training and key hyper-parameters
– Discuss the problems associated with saturation in neural network training
– What is ReLU activation function? Discuss its advantages and disadvantages

Source: SuperAnnotate
Vanishing Gradient problem
Vanishing gradient refers to a problem that can occur during the training of deep neural networks, when the gradients of the loss function with respect to the model’s parameters become extremely small (close to zero) as they are backpropagated through the layers of the network during training. This leads to impairment in learning in deep neural networks (DNN). When the gradients become too small, it means that the model’s weights are not being updated effectively. As a result, the network’s training may stagnate or become extremely slow, making it difficult for the network to learn complex patterns in the data.

Source: “Vanishing and Exploding Gradients in Neural Network Models” article by Katherine (Yi) Li
Activation functions like sigmoid and hyperbolic tangent (tanH) have saturated regions and are more prone to vanishing gradient problems in DNN training. The use of activation functions like ReLU and its variants can alleviate the vanishing gradient problem since they do not saturate for positive inputs. The derivative of ReLu is either 0 or 1. During backpropagation, when gradients are multiplied several times to obtain the gradients of the lower layers, ReLU derivatives has a nice property of being 0 or 1, instead of vanishing, leading to a more effective and faster training.

Source: “Advantages of ReLU over Sigmoid” thread, Stackexchange
Other techniques used to alleviate the vanishing gradient issue are: (a) Use of smart initialization techniques such as Xavier initialization, and He initialization, (b) Batch Normalization, and (c) Skip connections and residual connections
Exploding Gradient problem
In this problem, the gradients of the network’s cost function grow exponentially during training. When the gradient values become excessively large, they can cause large update to the weights; the weights can become NaN (not a number), or infinity, leading to numerical instability.
Similar to vanishing gradient, the issue of exploding gradient occurs more often when the tanh or sigmoid activation function is used in the hidden layers, since the output of these activations tends to be concentrated towards the extreme ends of the curve (0 or 1 for the sigmoid, or -1 and 1 for the tanh). The exploding gradient problem is particularly pronounced in deep networks with many layers, where the gradients are computed using the chain rule and can accumulate multiplicatively.
Following techniques are commonly used to prevent the exploding gradient problem, including:
- Gradient clipping: This technique involves clipping the gradients during backpropagation to ensure that they do not exceed a specified threshold
- Weight regularization: Adding a regularization term to the loss function can help to prevent the weights from becoming too large
- Proper weight initialization: Choosing the appropriate strategy for initializing the weights can help prevent gradients from exploding at the start of training
Video Explanation
- In the two-part video series by DeepLearning.AI, Professor Andrew Ng explains the issues of vanishing and exploding gradients using an illustrative example. This approach helps not only build an intuitive understanding of the problem but also highlight the significance of such issues within neural networks. The second video suggests a solution in dealing with this problem (Runtime: 6 min each)
- In the two-part video series by DeepLizard, the presenter explains the issues of vanishing and exploding gradients and then suggests a solution to the problem through the use of proper weight initialization techniques (Runtime: 7 + 10 = 17 mins )