In a deep network with many hidden layers, there might be hundreds or thousands of weights and bias terms, and it can be very computationally intensive to compute derivatives of all of the parameters of the network. Thus, it is often desired to increase the efficiency of the update and learning process, and the following are some effective options.
- Adaptive Gradient Descent (AdaGrad): Instead of using a constant learning rate throughout the training process for all parameters of the network, this approach allows for the learning rate to be adapted for individual parameters over the course of training. It usually starts with a larger learning rate for all parameters on the first few iterations, but if the gradient for certain parameters gets steeper, the corresponding learning rates for those parameters are decreased. In other words, it tries to ensure that an appropriate step size is taken for each parameter on each iteration so that the gradient reaches the minimum in a more efficient route than regular gradient descent.
- RMSProp: This algorithm uses an Exponentially Weighted Moving Average of the gradient in order to make the optimization process more efficient. Applying momentum works by smoothing out the oscillations of the gradient so that the algorithm takes a more direct path toward the minimum.
- Adam: The Adam optimizer improves upon both Adaptive Gradient Descent and RMSProp by making use of the first and second moments of the gradient so that the optimization moves in a more direct path while also being robust to the issue of a vanishing gradient. It is thought to be the best optimizer to use when training Neural Networks.