Among the common machine learning algorithms, which require feature scaling, and which do not?

As a general rule of thumb, if any component of the objective function of the algorithm involves a distance measure, either between observations or to a central location, the data should be scaled before training the algorithm. If the algorithm is rule-based, such as a decision tree, scaling is not necessary. However, even if there is not an explicit need to do so, it is never necessarily wrong to scale the data, but the scale should be noted when it comes to interpretation. Using this heuristic, the following is a (non-exhaustive) mapping of where some of the most common algorithms fit in this regard.

Scaling is Necessary

  • Neural Networks (more so to aid in convergence of gradient descent optimizer)
  • Regularized Regression (Ridge, LASSO, Elastic Net, etc.)
  • Support Vector Machine
  • K-Nearest Neighbors
  • K-Means
  • Dimensionality Reduction (PCA, Factor Analysis)

Scaling is Not Necessary

  • Ordinary Regression (regular Linear, GLM regression w/o regularization)
    • However, if optimization is done using gradient descent, scaling data helps in convergence. 
  • Decision Tree Methods (CART, Random Forest, GBM, etc.)
  • Naive Bayes