What are some automatic outlier detection mechanisms?
Isolation Forest: Isolation Forest works as an anomaly detection approach and is based on the Random Forest algorithm. It assigns an anomaly score between 0 and 1 to each observation, where values close to 1 indicate the points are more likely to be outliers, and values closer to 0 are unlikely to be anomalies. At a high level, the intuition of the algorithm is that in the construction of a decision tree, points that are outliers are more likely to be partitioned into nodes that are a shorter path from the root node, since a decision tree splits a variable in such a way that creates the most differentiation between the observations that fall into different nodes in a tree.
Local Outlier Factor: This algorithm uses an approach like K-nearest neighbors to quantify the dissimilarity of an observation compared to points in the same local region of data. At a high level, it compares the local density of an observation to that of its neighbors, and like Isolation Forest, lower values indicate the observation is less likely to be an outlier. It requires that the number of neighbors be provided as a hyperparameter. An advantage of this method is that it is better suited to identify outliers using a local perspective of the data, meaning that if an observation does not appear to be an outlier from a global view, if it is far enough away from the density of its neighbors, this algorithm is more likely to detect it. It depends on the context if it is of interest to detect outliers in this fashion.
Mahalanobis Distance: As Mahalanobis Distance measures the distance of each observation to the distribution of surrounding data points in higher dimensional space, observations with large distances are possible candidates to be flagged as outliers.