The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

Machine Learning Resources

What is the vanishing and exploding gradient problem, and how are they typically addressed?

Bookmark this question

Related Questions:
Explain the basic architecture of a Neural Network, model training and key hyper-parameters
Discuss the problems associated with saturation in neural network training
What is ReLU activation function? Discuss its advantages and disadvantages

Vanishing and Exploding Gradient
Vanishing and Exploding Gradient
Source: SuperAnnotate

Vanishing Gradient problem

Vanishing gradient refers to a problem that can occur during the training of deep neural networks, when the gradients of the loss function with respect to the model’s parameters become extremely small (close to zero) as they are backpropagated through the layers of the network during training. This leads to impairment in learning in deep neural networks (DNN). When the gradients become too small, it means that the model’s weights are not being updated effectively. As a result, the network’s training may stagnate or become extremely slow, making it difficult for the network to learn complex patterns in the data.

vanishing saturating gradient sigmoid
Title: Illustrating saturation region and vanishing gradient problem (derivative close to 0) for a Sigmoid activation function
Source: “Vanishing and Exploding Gradients in Neural Network Models” article by Katherine (Yi) Li

Activation functions like sigmoid and hyperbolic tangent (tanH) have saturated regions and are more prone to vanishing gradient problems in DNN training. The use of activation functions like ReLU and its variants can alleviate the vanishing gradient problem since they do not saturate for positive inputs. The derivative of ReLu is either 0 or 1. During backpropagation, when gradients are multiplied several times to obtain the gradients of the lower layers, ReLU derivatives has a nice property of being 0 or 1, instead of vanishing, leading to a more effective and faster training.

Title: Comparing gradients of Sigmoid, tanH and ReLU
Source: “Advantages of ReLU over Sigmoid” thread, Stackexchange

Other techniques used to alleviate the vanishing gradient issue are: (a) Use of smart initialization techniques such as Xavier initialization, and He initialization, (b) Batch Normalization, and (c) Skip connections and residual connections

Exploding Gradient problem

In this problem, the gradients of the network’s cost function grow exponentially during training. When the gradient values become excessively large, they can cause large update to the weights; the weights can become NaN (not a number), or infinity, leading to numerical instability.

Similar to vanishing gradient, the issue of exploding gradient occurs more often when the tanh or sigmoid activation function is used in the hidden layers, since the output of these activations tends to be concentrated towards the extreme ends of the curve (0 or 1 for the sigmoid, or -1 and 1 for the tanh). The exploding gradient problem is particularly pronounced in deep networks with many layers, where the gradients are computed using the chain rule and can accumulate multiplicatively.

Following techniques are commonly used to prevent the exploding gradient problem, including:

  • Gradient clipping: This technique involves clipping the gradients during backpropagation to ensure that they do not exceed a specified threshold
  • Weight regularization: Adding a regularization term to the loss function can help to prevent the weights from becoming too large
  • Proper weight initialization: Choosing the appropriate strategy for initializing the weights can help prevent gradients from exploding at the start of training

Video Explanation

  • In the two-part video series by DeepLearning.AI, Professor Andrew Ng explains the issues of vanishing and exploding gradients using an illustrative example. This approach helps not only build an intuitive understanding of the problem but also highlight the significance of such issues within neural networks. The second video suggests a solution in dealing with this problem (Runtime: 6 min each)
  • In the two-part video series by DeepLizard, the presenter explains the issues of vanishing and exploding gradients and then suggests a solution to the problem through the use of proper weight initialization techniques (Runtime: 7 + 10 = 17 mins )
Vanishing and Exploding Gradients

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics