What are some common distance metrics that can be used in clustering?

  • Euclidean Distance: The Euclidean Distance, or L2 norm, is the most common distance metric used in clustering. It measures the straight-line, or shortest, distance between observations in p-dimensional space. In two dimensions, the formula reduces to the commonly used Pythagorean theorem. For two observations x1 and x2, the Euclidean Distance is computed by the following, where the superscript denotes the dimension from 1 to n:
  • Mahalanobis Distance: The Mahalanobis Distance is a multivariate form of the Euclidean Distance that accounts for correlation between dimensions. If the correlation between features is 0, the Mahalanobis Distance is equivalent to the Euclidean distance. The Mahalanobis Distance is frequently used in multivariate outlier detection, as it measures the distance of each observation to the mass of the distribution using its mean and correlation structure. For two vectors x and x’ and covariance matrix Σ, the Mahalanobis Distance is given by
  • Manhattan Distance: The Manhattan distance, or L1 norm, measures the sum of absolute distance between two vectors. This measure calculates distance in a grid-like path rather than as the crow flies. It is believed that as the dimension of the data increases, the Manhattan Distance is preferred to the Euclidean, as the latter is more prone to suffer from the Curse of Dimensionality. 
  • Minkowski Distance: The Minkowski Distance is a general form for computing distances using an Lp norm. If p=1, it reduces back to the Manhattan Distance, and if p=2, it becomes the Euclidean Distance. 
  • Cosine Similarity: Cosine Similarity measures similarity using the cosine of the angle generated between two vectors in p-dimensional space. For two vectors x1 and x2, the similarity measure is found by the ratio of the dot product of two vectors over the product of the magnitude of the vectors. A common application of Cosine Similarity is in measuring document similarity in natural language processing.
  • Jaccard Index/Distance: The Jaccard Index measures similarity for two sets of data by computing the ratio of items present in both sets (the intersection) to the total number of distinct items present in either set (the union). As a larger Jaccard Index indicates more similar sets, it can be converted to a distance metric by subtracting the index from 1. It is also a measure that is commonly used in measuring text similarity, as documents can be decomposed into sets based on the words they contain. For two sets X1 and X2, the Jaccard Distance is given by