*Related articles*:*– What is Linear Regression?– What is Supervised Learning?*

### Estimation of Regression Coefficients

The two most common algorithms for deriving the coefficients of a linear regression model are Ordinary Least Squares (OLS) and Maximum Likelihood (MLE) estimation. In the standard setup of linear regression, where the assumptions of independent, identically distributed (Normal) residuals with constant variance are satisfied, the OLS and MLE estimates are the same.

### Ordinary Least Squares (OLS):

Theoretically, there are infinitely many potential regression lines that could be fit to model the relationship between the independent variable X and target variable Y. The concept behind OLS is to find the “best” line for the data, which by this standard is the one that minimizes the sum of the squared residuals (SSR), or difference between the actual values of the dependent variable and the model’s predictions.

There are two main reasons that the function considers the sum of the squared residuals rather than the sum of the absolute residuals. The first is that by taking the square, it ensures that after summation, the metric will always be positive, and that positive residuals will not cancel out negative residuals when summed up. The second is that the square function magnifies the contribution, or penalty, for exceedingly large residuals compared to taking the absolute difference, thus theoretically resulting in a better fit model overall. The following graphic illustrates how a regression line would be fit using the OLS algorithm.

The mathematical setup for OLS begins with the standard regression equation, which in the case of simple linear regression (one X variable) is:

or in matrix form,

After the model is fit, the sum of squared residuals (SSR) is obtained, and the estimates of the model coefficients are found by taking the partial derivatives of SSR and setting them to 0 to solve what is called the normal equations. In standard matrix form, the OLS solution is written as:

Since this is a closed form solution, the above formula holds true even for higher feature space of X

### Maximum Likelihood Estimation (MLE):

Maximum likelihood is a common statistical estimation technique that produces estimates for the coefficients that maximize the likelihood function. The conceptual framework of MLE is that it derives estimates for the parameters that maximize the likelihood of observing the data seen by the model based on its underlying statistical assumptions, which in the case of linear regression, is that the residuals follow a normal distribution with some unknown but constant variance.

Derivation of the MLE estimates for a simple linear regression begins with the likelihood function, *L* for *n* independent and identically distributed (*iid*) observations of a normal distribution, which is mathematically represented by the multiplication of the probability density functions (PDFs) for the said distribution:

In order to simplify the computation, the logarithm of the likelihood function is taken. Being that the log is a monotonic transformation, it preserves the location of maximum points while improving computational stability.

Using the log-likelihood, the partial derivatives with respect to and are taken, and the equations are set equal to 0 in order to solve for the estimates of the respective betas.

For ,

And for ,

After some algebra and simplification, the MLE estimates for and can be written as:

Maximum likelihood estimation has some desirable statistical properties, such as producing unbiased estimates that are consistent, which means that as the sample size of the data increases, the estimates produced by MLE converge to the true values of their respective parameters. They are also efficient estimators, meaning that MLE produces estimates with lower standard errors compared to any other set of unbiased coefficient estimates.

While it is hard to summarize how MLE works in one visualization, the following curves provide some insight. The histogram represents observed data that is assumed to be generated from some type of distribution, such as normal with an unknown mean and variance. Each curve represents a different estimate for the underlying mechanism that produced the data shown in the histogram. The blue curve appears to be a poor model for this data, meaning the parameter estimates it chose (mean 30.1, sd 3.8) do not seem to be the most likely values of the actual mean and variance. Of the four curves, the brown curve (mean 30.5, sd 4.3) seems to provide the best fit, so if there were only four possible models that could have generated this data, that one would probably be considered the most likely.

### Conclusion

As stated, under the assumptions of linear regression, the OLS estimates and the MLE estimates for the coefficients are equivalent. This equivalency is contingent on the assumption of normally distributed residuals. In the MLE setup, the underlined term on the right hand side of the log-likelihood function is equivalent to the sum of the squared residuals, thus showing the connection to OLS estimation.

### Video Explanation

- Through clear visualizations and graph-based explanations, the video titled ‘Fitting a line to data’ from Statquest delves into how the coefficients of linear regression is derived via the least squares method. (
*Runtime: 9 mins*)

Deriving linear regression coefficients using Ordinary Least Squares method by Statquest

- In the following two videos by NPTEL MOOC series on Linear Regression, Prof. Shalabh from IIT Kanpur explains the complete derivation of how linear regression coefficients are estimated using both Least Squares Method and Maximum Likelihood Estimate.

- Deriving of coefficients via Ordinary Least Squared (OLS) approach

- Deriving of coefficients via Maximum Likelihood Estimate (MLE) approach