The vanishing or exploding gradient is an issue often encountered in the training of deep Neural Networks. When the derivatives of the parameters are propagated backwards from the output layer through the hidden layers, if the gradient becomes very small, the parameters do not get updated or only do so barely, essentially causing the training to come to a halt before a viable solution is reached. This is referred to as the vanishing gradient.
On the other hand, if the gradient grows large, the parameters get updated by an excessive amount on each step of back propagation, which can also produce an unstable solution. This phenomenon is referred to as the exploding gradient. The issue of vanishing or exploding gradient occurs more often when the TanH or sigmoid activation function is used in the hidden layers, since the output of these activations tends to be concentrated towards the extreme ends of the curve (0 or 1 for the sigmoid, or -1 and 1 for the TanH), which are regions in which the derivative is the smallest.