The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

Machine Learning Resources

How are categorical features or qualitative predictors represented in a machine learning model?

Bookmark this question

To incorporate categorical features, also known as qualitative predictors, into a machine learning model, they must first be converted into a numerical format, as most statistical and machine learning algorithms require data to be in numeric form for processing. This is essential because many datasets include categorical variables such as loan types (e.g., New Auto, Used Auto, Signature), gender (Male , Female), t-shirt sizes (Small , Medium , Large), or geographic locations (states or countries), which are inherently non-numeric. The transformation of categorical data into numbers allows these algorithms to perform calculations and make predictions.

In this article, we begin with a concise overview of various types of categorical features. Following that, we will delve into different encoding techniques used to handle categorical values in a machine learning model.

Types of Categorical features

There are two types of categorical features:

  1. Nominal features: This corresponds to features where there is no inherent ordering between different values. For example: gender, color etc. These features can be represented using one hot encoding, and dummy encoding techniques. The primary disadvantage of using nominal features is the explosion of the feature set, especially if the number of unique possible values is large.
  2. Ordinal features: These are categorical features that have an inherent ordering to them. For example: the size of a t-shirt (Small/Medium/Large). These features can be represented using Ordinal encoding. The disadvantage of using Ordinal features is that the difference between two representative integers might not be a true representation of the ordinal difference between the raw features. 

How to handle categorical features in a machine learning model?

There are three common approaches to deal with categorical or qualitative predictors: (a) Dummy encoding, (b) One-hot encoding, and (c) Ordinal encoding

Dummy Encoding

The classical approach to dealing with nominal categorical or qualitative predictors is to use dummy encoding to numerically represent the different levels of that feature using binary 1’s and 0’s. If a predictor has k categories, k-1 dummy variables are needed to uniquely represent that attribute in the model. As one level can be represented when all of the dummy variables are set to 0, there should be one less dummy variable compared to the total number of unique categories of the variable to avoid redundancy.

The table below depicts Dummy encoding using an Auto Loan example, where there are three types of loan: New Auto, Used Auto, and Signature. Dummy encoding would produce the following transformation of data, creating two new Dummy variables and the third category (‘Signature’ in this example) is inferred when the value of the other two categories is 0. ‘Signature’ category is sometimes also called the reference level.

LoanID Loan Type (Original) New Auto
(Dummy variable)
Used Auto
(Dummy variable)
New Auto10The value of dummy variable New Auto is 1 and that of Used Auto is 0
L2Used Auto01The value of dummy variable New Auto is 0 and that of Used Auto is 1
L3Used Auto01Same as L2
L4Signature00The value of both dummy variables New Auto and Used Auto is 0
L5New Auto10Same as L1
An example for Dummy Encoding, Source:

One-hot Encoding

A similar but slightly different technique from dummy encoding is one-hot encoding used for nominal categorical features. In this approach, assuming the same scenario of a predictor with k categories, k different binary columns are created, where each takes on the value of 1 if the original value of that observation belongs to that particular category and 0 otherwise. For the same dataset, one-hot encoding would transform the loan type variable as follows:

LoanIDLoan Type (Original)New Auto
(One-hot variable)
Used Auto
(One-hot variable)
(One-hot variable)
New Auto100
The value of one-hot variable New Auto is 1 and others are 0
L2Used Auto010The value of one-hot variable Used Auto is 1 and others are 0
L3Used Auto010Same as L2
L4Signature001The value of one-hot variable Signature is 1 and others are 0. In Dummy encoding, 'Signature' does not appear as a separate variable
L5New Auto100Same as L1
An example for One-hot Encoding, Source:

In one-hot encoding, each category has a separate regression coefficient in the model. This is in contrast to dummy encoding, where only k-1 levels have coefficients representing the effect of that level. In either dummy or one-hot encoding, if an observation contains missing values for the original variable, it would be represented with a value of 0 for all columns created as a result of using either transformation technique, since a value of 1 only appears to indicate the presence of the original value.

Ordinal Encoding

When dealing with ordinal categorical features, where there is a natural and consistently-spaced ordering to the levels of a categorical variable, such as temperature being recorded on a scale of low, medium, or high, it might make sense to map its values to the integers values 1, 2, and 3, respectively. The transformation of such a variable would look like the following:

ObservationIDTemperature (Original)Temperature_coded
(Ordinal variable)
An example for Ordinal Encoding, Source:

It should be clearly noted that if there is no intrinsic order to the variable’s categories, this would not be a viable approach. It also might be questionable if the original categories are spaced at different intervals apart. For example, if Low represented 0 degrees, Medium 10 degrees, and High 50 degrees, it might be difficult to preserve the practical meaning of that spacing after an ordinal transformation, thus possibly leading to information loss. If it is decided not to use ordinal encoding, dummy encoding would probably be more suitable than one-hot encoding, as the Low category might naturally lend itself to the reference level, and the order would still be preserved among the three levels. 

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics

Partner Ad

Learn Data Science with Travis - your AI-powered tutor |