Discretization refers to the process of binning a continuous variable into a discrete number of buckets. In some machine learning algorithms, the performance can be improved by using this kind of representation, especially if there are outliers on the original scale of the variable that cause its distribution to be skewed. There are several different techniques that can be used to discretize a continuous variable, and some of the most common are listed below:
- Equal frequency bins: This technique creates buckets that contain a roughly equal number of observations in each bin. An advantage of this approach is that it creates a balanced distribution out of the new categories created. However, it does not guarantee that the spacing between endpoints is even in the same manner.
- Equal width bins: This technique works similarly to equal frequency bins, but now, the spacing between endpoints of each bin is roughly equal. However, the number of observations falling into each bucket is no longer constrained, so some buckets might contain very few data points.
- Decision Tree discretization: This method fits a decision tree using the candidate variable as input and the target as the output. The splits it creates during the construction of the tree form the endpoints of the discrete version of the variable, and the actual values for the derived variable are usually the average values of the target at each node of the decision tree. This technique provides the advantages that using a decision tree in a supervised learning context does, but unlike the previous two methods, it does require utilizing the target variable in the discretization process.
- Subject Matter Knowledge: Depending on the context of the problem, domain knowledge might be more beneficial than a quantitative technique in determining how to create buckets. Ultimately, most data science projects are only useful if they are applied in an appropriate context that derives benefit to the stakeholders.