Vanishing gradient refers to a problem that can occur during the training of deep neural networks, when the gradients of the loss function with respect to the model’s parameters become extremely small (close to zero) as they are backpropagated through the layers of the network during training. This leads to impairment in learning in deep neural networks (DNN). When the gradients become too small, it means that the model’s weights are not being updated effectively. As a result, the network’s training may stagnate or become extremely slow, making it difficult for the network to learn complex patterns in the data.

Activation functions like sigmoid and hyperbolic tangent (tanH) have saturated regions and are more prone to vanishing gradient problems in DNN training. The use of activation functions like ReLU and its variants can alleviate the vanishing gradient problem since they do not saturate for positive inputs. The derivative of ReLu is either 0 or 1. During backpropagation, when gradients are multiplied several times to obtain the gradients of the lower layers, ReLU derivatives has a nice property of being 0 or 1, instead of vanishing, leading to a more effective and faster training.

Other techniques used to alleviate the vanishing gradient issue are: (a) Use of smart initialization techniques such as Xavier initialization, and He initialization, (b) Batch Normalization, and (c) Skip connections and residual connections