There are three main variants of gradient descent. The first step in each approach is always to randomly initialize the values of the parameters to kick off the optimization procedure.
- Batch Gradient Descent: In the batch implementation, the parameters are only updated after an entire pass, or epoch, through the data set. While it is guaranteed to at least converge to a local minimum, it is slow to perform on large data sets, since the parameters cannot be updated until every observation is accounted for. Therefore, many passes through the data might be required in order to reach convergence.
- Stochastic Gradient Descent: In the stochastic case, the parameters are updated after each instance in the data set is processed. Since an update happens after each observation, the learning process moves faster than in batch gradient descent. However, since the gradient is only computed on a single training example, it cannot make use of vectorization and the computational efficiencies it provides. Also, the path is prone to oscillation once it approaches the minimum, and therefore, the algorithm does not actually converge but only wander around the region around the minimum. An option to address this deficiency is to use a dynamic learning rate that decreases once the descent process approaches the minimum.
- Mini Batch Gradient Descent: Mini batch is a middle ground between full batch and stochastic gradient descent where the parameters are updated after a subset of the data set is processed. Note that if the subset size is 1, stochastic gradient descent is performed, and if it is the size of the original data set, it is equivalent to batch gradient descent. Since mini-batch can update parameters before reaching the end of the data set and is also able to make use of vectorization, it runs faster than batch gradient descent. The batch size is a parameter that usually needs to be tuned, as larger batch sizes cause the optimization process to behave more like batch gradient descent, and smaller sizes translate to a process more like stochastic gradient descent.