What is Overfitting?

Overfitting occurs when a model fails to perform at a similar level of accuracy on data that was not used in the training process compared to data that it was explicitly built on. This means that it does not generalize well to unseen data, which is a problem in situations where a model is used to produce predictions on future data. Overfitting usually occurs when a model is too complex for the data at hand, such as using settings of hyper-parameters that cause the model to fit to noise within the training data. 

Overfitting can be mitigated by doing the following:

  • Collect more training data: A larger data set allows a model to have access to a greater variety of data so that it will be able to have lower variance on future observations. However, if the additional data is not providing additional information to the model, more data alone will not necessarily improve performance. In some applications, it can be very time consuming to collect data, so simply seeking more observations might not be the most practical approach. The most important aspect regarding data size is to have enough data to sufficiently explore the feature space. 
  • Use regularization: Regularization reduces the complexity of the model by shrinking its coefficients closer to 0, especially for variables that are least predictive of the target. 
  • Simplify model hyperparameters (if applicable to algorithm): Similarly to regularization, simplifying model hyperparameters prevents a model from fitting to the training data so well that it does not generalize to a new sample of data. 
  • Implement early stopping: Early stopping reduces training time by terminating when the performance of the model does not improve by a substantial amount over several iterations. Using early stopping both improves efficiency and mitigates against overfitting.