The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

Machine Learning Resources

What is Feature Scaling? Explain the different feature scaling techniques

Bookmark this question

Feature scaling is a data preprocessing technique that involves transforming the numerical values of features to a standardized scale to make it more suitable for modeling. This process is necessary because different features can have vastly different scales and ranges of values, which can negatively impact the performance of machine learning algorithms that are sensitive to the scale of the input data.

Why is feature scaling done?

  1. Improve model performance: Scaling is a critical data-processing step in distance-based machine learning algorithms like KNN, PCA, etc. Failure to scale features can result in a model fit that is substantially different from the one obtained with unscaled data. This is because unscaled data can lead to the domination of features with higher value ranges during distance calculations, making them more influential during model training. Therefore, feature scaling is necessary to bring all features in a dataset to a comparable scale, ensuring that each feature contributes equally to the model and that no one feature dominates it. This, in turn, creates more defined decision boundaries and improves model performance. The following picture illustrates this concept.
KNN with and without scaling
KNN fitted to non-scaled (left) and scaled (right) Wine Recognition dataset from UCI. Feature Scaling leads to a completely different model with more defined decision boundary (Source: Sci-kit learn website)

2. Faster convergence: Scaling is essential for achieving faster convergence, particularly in gradient descent-based models such as neural networks. By scaling, we can accelerate the gradient descent algorithm’s speed because θ descends rapidly on small ranges and slowly on large ranges, leading to inefficient oscillations when variables are significantly uneven.

3. Other reasons for scaling:

  • In regression, scaling is often recommended so that the predictors have a mean of 0. This makes it easier to interpret the intercept term as the expected value of Y when the predictor values are set to their means.
  • when using Lasso or Ridge regression. Lasso puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables. This is true for Ridge regression also
  • when you’re trying to sum or average variables that are on different scales, perhaps to create a composite variable
  • when creating power terms. Let’s say you have a variable, 𝑋, and you want to create an 𝑋2 term. If you don’t center 𝑋 first, your squared term will be highly correlated with 𝑋, which could muddy the estimation of the beta
  • when creating interaction terms. If an interaction term is created from two variables that are not centered on 0, some amount of collinearity will be induced 

How to do feature Scaling?

Feature scaling commonly uses two techniques: Normalization and Standardization. There are several methods under each of these categories, but two of the most common ones are:

  1. Min-max scaling: This technique scales the data to a specific range (usually between 0 and 1) or [-1,1] or [0,5] etc. Min-max scaling preserves the original distribution of the data but may be sensitive to outliers. The formula for min-max scaling is given below:
    min max normalizer formula equation
  1. Standardization: This technique scales the data to have zero mean and unit variance. Standardization is robust to outliers but may change the distribution of the data. The standardization formula is given below:
    standarization formula

Following image illustrates the difference between Standardization and Normalization

Which Feature Scaling method to choose when?

The following infographic can serve as a useful reference point in deciding which scaling method to use under different circumstances.

Feature Scaling techniques (Source: research)

Like with other machine learning techniques, it may be necessary to experiment with different methods and select the one that yields the best performance.

Common Machine Learning Algorithms using Feature Scaling

Essential for algorithms using Distance methods or Gradient Descent:

  • K Nearest Neighbor
  • Support Vector Machines
  • K-means Clustering
  • Neural Networks (ANN, CNN, RNN, LSTM)
  • PCA

Recommended for:

  • Logistic Regression
  • Linear Regression

Less important for:

  • Tree Based algorithms such as Decision Trees, Random Forest, Bagging
  • Gradient Boosted Decision Tree

Feature Scaling requirements for different Machine Learning Algorithms

As a rule of thumb, it’s advisable to scale your features if your algorithm involves computing distance or uses gradient descent as the learning algorithm. Scaling can significantly impact the performance of a machine learning model. It’s worth experimenting with various scaling techniques to achieve the desired level of performance.

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics

Partner Ad

Learn Data Science with Travis - your AI-powered tutor |