How are coefficients of linear regression estimated?

Linear regression minimizes the squared difference between the actual values of the response and the predicted values from the model. 

In the case of simple linear regression the model takes the following theoretical form:

Yi = ?0 + ?1Xi + εi, where i=1,…,n, and εi ~ N(0, ?2)

where yi refers to the ith observation of the target variable, 

?0 is is an intercept term, or the value of the target when the predictor variable is set at 0, 

?1  is the coefficient of the predictor, or the average magnitude of change seen in the response for a 1 unit change in the value of the predictor,

Xi  is the ith value of the predictor variable,

And εi is the residual, or the unexplained effect of everything else not accounted for in the behavior of the response through the predictor X.

The process of obtaining estimates for the ? coefficients is referred to as least squares estimation. Basically, the algorithm searches for the line (by determining the values for the intercept and slope) that provides the best fit through all of the observations. In the case of more than one predictor, everything about linear regression still holds true, but the optimization is now done in multivariate, or high-dimensional, space. While the equation above illustrates the form of a linear regression model for one predictor variable, for generalization to multiple predictors, the equation is usually written in matrix form 

Y = X? +  ε

where Y is a nx1 vector of response values,

X is an nxp feature, or design matrix,

And ε is an nx1 vector of residuals

The process of solving the least squares normal equations produces estimates for the slopes and intercept, often referred to as the “beta-hats”, which in matrix form, is given by 

? = (X’X)-1X’Y, which is a pX1 vector of estimates

As this formula contains information from both X and Y, even in a higher feature space, the coefficient estimates correspond to the weights that optimize the linear relationship between the features and response by minimizing the sum of squared distances from the observations to their target values. The parameter estimates can also be found using Maximum Likelihood estimation, which seeks to find the values of the parameters that maximize the likelihood of generating the observed data.