While it can occur even in well-tuned networks, the following are some options that have been shown to reduce the risk of experiencing a vanishing or exploding gradient.
- Use the ReLU activation rather than TanH or Sigmoid in the hidden layers
- Use something more sophisticated than random initialization for weights and biases
- Reduce the number of hidden layers (i.e. simplify the network architecture)