What problems would arise from using a regular linear regression to model a binary outcome?

  • Predicted values would not be constrained to the range of [0,1], resulting in predictions that are not valid probabilities. 
  • The outcome only takes on values of 0 and 1, and thus it is not a continuous, normal distribution. Thus, a regression line would seek to fit something that connects the observations with a label of 0 to those having a label of 1, which is a poor representation for such a setup.
  • Since the theoretical variance of a Bernoulli distribution is p(1-p), it is a function of the mean, and thus it does not exhibit constant variance (one of the assumptions for linear regression model to hold true)