- K-Means starts by selecting initial centroids for the k-clusters by randomly choosing k observations as the centroids. Different initialization approaches can result in a different clustering assignment, so the algorithm is often run several times, where the iteration that produces the most compact clusters is ultimately chosen. The researcher must decide on what value to use for k ahead of time, but similar to supervised machine learning algorithms, it can be thought of as a hyperparameter to be tuned.
- It then assigns each observation to the cluster of its nearest centroid based on a multivariate distance measure, such as Mahalanobis Distance. Because the algorithm is distance-based, all of the features should be scaled to a similar range as part of the data preprocessing.
- Next, it recalculates the centroids of each cluster by moving it to the centroid of all observations currently belonging to that cluster.
- After all of the assignments are made, it re-assigns any observations to a different cluster that would result in more similar clusters than on the previous step.
- It continues in this iterative manner until no more reassignments result in further improvement of the clustering.