Feature scaling is a data preprocessing technique that involves transforming the numerical values of features to a standardized scale to make it more suitable for modeling. This process is necessary because different features can have vastly different scales and ranges of values, which can negatively impact the performance of machine learning algorithms that are sensitive to the scale of the input data.
Why is feature scaling done?
- Improve model performance: Scaling is a critical data-processing step in distance-based machine learning algorithms like KNN, PCA, etc. Failure to scale features can result in a model fit that is substantially different from the one obtained with unscaled data. This is because unscaled data can lead to the domination of features with higher value ranges during distance calculations, making them more influential during model training. Therefore, feature scaling is necessary to bring all features in a dataset to a comparable scale, ensuring that each feature contributes equally to the model and that no one feature dominates it. This, in turn, creates more defined decision boundaries and improves model performance. The following picture illustrates this concept.
2. Faster convergence: Scaling is essential for achieving faster convergence, particularly in gradient descent-based models such as neural networks. By scaling, we can accelerate the gradient descent algorithm’s speed because θ descends rapidly on small ranges and slowly on large ranges, leading to inefficient oscillations when variables are significantly uneven.
3. Other reasons for scaling:
- In regression, scaling is often recommended so that the predictors have a mean of 0. This makes it easier to interpret the intercept term as the expected value of Y when the predictor values are set to their means.
- when using Lasso or Ridge regression. Lasso puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables. This is true for Ridge regression also
- when you’re trying to sum or average variables that are on different scales, perhaps to create a composite variable
- when creating power terms. Let’s say you have a variable, 𝑋, and you want to create an 𝑋2 term. If you don’t center 𝑋 first, your squared term will be highly correlated with 𝑋, which could muddy the estimation of the beta
- when creating interaction terms. If an interaction term is created from two variables that are not centered on 0, some amount of collinearity will be induced
How to do feature Scaling?
Feature scaling commonly uses two techniques: Normalization and Standardization. There are several methods under each of these categories, but two of the most common ones are:
- Min-max scaling: This technique scales the data to a specific range (usually between 0 and 1) or [-1,1] or [0,5] etc. Min-max scaling preserves the original distribution of the data but may be sensitive to outliers. The formula for min-max scaling is given below:
- Standardization: This technique scales the data to have zero mean and unit variance. Standardization is robust to outliers but may change the distribution of the data. The standardization formula is given below:
Following image illustrates the difference between Standardization and Normalization
Which Feature Scaling method to choose when?
The following infographic can serve as a useful reference point in deciding which scaling method to use under different circumstances.
Like with other machine learning techniques, it may be necessary to experiment with different methods and select the one that yields the best performance.
Common Machine Learning Algorithms using Feature Scaling
Essential for algorithms using Distance methods or Gradient Descent:
- K Nearest Neighbor
- Support Vector Machines
- K-means Clustering
- Neural Networks (ANN, CNN, RNN, LSTM)
- Logistic Regression
- Linear Regression
Less important for:
- Tree Based algorithms such as Decision Trees, Random Forest, Bagging
- Gradient Boosted Decision Tree
As a rule of thumb, it’s advisable to scale your features if your algorithm involves computing distance or uses gradient descent as the learning algorithm. Scaling can significantly impact the performance of a machine learning model. It’s worth experimenting with various scaling techniques to achieve the desired level of performance.