Some of the assumptions of the linear regression model includes independence, normality, constant variance and linearity describes as below:
- Independence: the residuals are independent instances taken from a normal distribution centered at 0 with some unknown, but constant variance. The independence part implies that there is no inherent auto-correlation present within the data, which would require a different variance structure to be specified. The only way to guarantee the observations are sampled independently is to have knowledge and control over the study design, but as a proxy, a simple plot of the residuals against an index value, such as a counter from 1 to the total number of observations, can be examined for any noticeable trends that might indicate autocorrelation is present.
- Normality and constant variance of residuals: this assumption can be checked by plotting a histogram or QQ-plot of the residuals. The histogram should follow a mound shape with one peak around 0, and on the QQ-plot, the points should fall close to a line passing through the origin. In addition, a residuals vs. fitted values plot can be examined, and if there are regions on the plot that seem to indicate more or less spread than others, that might be a sign of non-constant variance. Violations of the normality and constant variance assumptions might indicate that a transformation is needed of the response variable and that weights might need to be added to the variance structure, respectively. As a point of clarification regarding this and the previous regression assumption, the suggested diagnostics can only be created after the model is fit, being they rely on the residuals as input. Thus, if a more manual model selection procedure is being used, it is necessary to store these residuals/diagnostics to compare against those of other fitted models.
- Linearity: Linear regression assumes that each predictor is linearly related to the target. However, the linearity is only required to be present in the coefficients, not the data itself, as it is often necessary to transform the original values of a predictor in order to achieve linearity. This assumption is best checked by creating a scatterplot of each predictor against the target and then making any transformations, such as the log or square-root. Violations of linearity can result in a poorly specified model that experiences high bias.
- Little to no multicollinearity: Multicollinearity occurs when one or more predictors experience high correlation between each other. This creates a problem for the model to be able to precisely estimate the coefficients, as it has difficulty honing on the exact effect of each variable. Signs of multicollinearity include a significant overall model but no individual significant coefficients and unexpected coefficient signs, such as a negative sign on a variable that clearly should have a positive association with the target. Multicollinearity can be detected by examining a correlation matrix of the predictors and identifying pairs of predictors with high correlation, such as 0.7 or above. It can also be measured through the Variance Inflation Factors (VIFs), of which high values (especially 10 or above) indicate variables that are causing an issue with multicollinearity.
Compared to more modern machine learning techniques like decision trees, this is one of the perceived drawbacks of linear regression, that it is dependent on several assumptions in order to reliably make use of the results.