What are some options for clustering on categorical data? What if the dataset contains a combination of numeric and categorical features?

  • K-Modes: K-Modes is a modification of K-Means suitable for datasets with all categorical features that clusters based on matches/mismatches across the features of the observations rather than numerical distance. The algorithm performs cluster assignment and iterates in the same way as k-means, just utilizing a different measure of similarity. 
  • K-Medoids (PAM Clustering): This approach, which stands for Partitioning Around Medoids, accounts for mixed data types by using a different similarity measure for numeric versus categorical features. It uses a measure called the Gower Distance to compute the partial similarities based on data type. PAM clustering is more robust to outliers compared to K-Means but can be computationally expensive on large datasets.