What are some ways to address the vanishing/exploding gradient issue?

While it can occur even in well-tuned networks, the following are some options that have been shown to reduce the risk of experiencing a vanishing or exploding gradient.

  • Use the ReLU activation rather than TanH or Sigmoid in the hidden layers 
  • Use something more sophisticated than random initialization for weights and biases
  • Reduce the number of hidden layers (i.e. simplify the network architecture)