AIML.com

Machine Learning Resources

What is Overfitting?

Bookmark this question

Related Questions:
How to mitigate Overfitting?
What is Underfitting?

What is the Bias-Variance Tradeoff?

Overfitting occurs when a machine learning model becomes too complex and starts fitting the training data too closely. This causes the model to learn the noise and random fluctuations in the training data instead of the underlying patterns and relationships that are relevant to the problem being solved. As a result, the model may perform very well on the training data but poorly on new, unseen data.

The best way to identify if a model is being overfitted is to compare the training and test error. Generally speaking, training error is lower than the test error. And the goal of any machine learning model is to minimize both: a) the training error, as well as, b) the gap between training and test error. If the training error is low, and the difference between training and test error is significant then the model might be in the overfitting zone. The following figure illustrates how training and test error can help identify if a model is underfitted, optimally fitted or overfitted:

Using Training and Test error to identify underfitting and overfitting.
Comparing Training and Test error can help determine if a model is underfitted, optimally fitted, or overfitted (Source: Al-Behadili et al. Rule pruning techniques in the ant-miner classification algorithm and its variants: A review)

One can think of Overfitting as a form of “memorization” of the training data, where the model becomes too specialized to the training data and loses its ability to generalize to new data.

Causes of Overfitting and how it can be mitigated

There are several common causes of overfitting. One is using a model that is too complex for the given dataset. For example, a decision tree with too many levels or a neural network with too many layers and hidden units may be more complex than is necessary to solve the problem at hand. Similarly, having too many features can also make a model complex.

Another cause of overfitting is using too few examples to train the model, as this can make it more difficult to find the underlying patterns in the data. In that case getting more training data can help.

In addition to these causes, overfitting can also occur when the model is not regularized properly. Regularization techniques, such as L1 or L2 regularization, add a penalty term to the model’s loss function that discourages it from fitting the training data too closely. Dropout, another regularization technique, randomly drops out some of the neurons in a neural network during training, which can help prevent the network from memorizing the training data.

Finally, overfitting can be mitigated by using cross-validation, which replicates evaluating the model’s performance on new, unseen data. Cross-validation involves splitting the dataset into multiple parts, training the model on some parts and evaluating its performance on the remaining parts. This helps ensure that the model can generalize well to new data and is not simply memorizing the training data.

CausesMitigation
Complex Model
– Too many features
– Too many parameters and layers (in case of Neural network)
– Hyperparameters like tree depth in decision trees, order of polynomial used for regression or svm kernels, etc.
Try reducing the model complexity by:
– using less number of features (feature selection, dimensionality reduction)
– reducing the number of layers and dimensions in case of neural networks
– using optimal hyperparameters for model complexity
Too few training examplesGet more training data
No RegularizationRegularization Techniques:
– L1 or L2 regularization
Dropout
– Early stopping
No Cross ValidationUse Cross Validation to pick a model that generalizes well
Overfitting Causes and Mitigation strategies (Source: AIML.com Research)

Miscellaneous: Would collecting more training data necessarily helps with overfitting?

A larger data set allows a model to have access to a greater variety of data so that it will be able to have lower variance on future observations. However, if the additional data is not providing additional information to the model, more data alone will not necessarily improve performance. In some applications, it can be very time consuming to collect data, so simply seeking more observations might not be the most practical approach. The most important aspect regarding data size is to have enough data to sufficiently explore the feature space. 

Visual Explanation

The following infographics explains how does the decision boundaries of a overfitted classification and regression model looks like in comparison to an optimally fitted model. It also further summarizes the key characteristics of overfitting, and how to mitigate them.

Decision boundaries for overfitted regression and classification models (Source: Wonseok Shin)

Video Explanations

For a quick 2 min introduction on What is Overfitting and how to identify if a model is overfitted or not, please see this explanation from IntuitiveML [Runtime: 1:40 mins]

What is overfitting?

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can