What are the subtypes of Cross Validation?

  • Validation data set: The full data set is randomly partitioned into training and validation sets, where the former is used to build the model and the latter is reserved for evaluation after the model has been fit. This method is usually most appropriate to use on data sets that are large enough so that a fully representative validation set can be partitioned out of the original data set.   
  • K-fold cross validation: The full dataset is randomly partitioned into a number (k) of sub-datasets, or folds, where the model is trained on k-1 folds and evaluated on the kth fold. The evaluation metric is computed on the fold not used in training, and that value is stored. The process repeats until all k folds have been separately reserved for evaluation, or each observation is used one time in evaluation and trained on k-1 time. The overall model score is then determined by averaging the evaluation metric across all of the k folds. An illustration of k-fold cross validation when k=5 (5 and 10 are common choices for k) is shown below.

    M = (M1 + M2 + M3 + M4 + M5) / 5, where Mi is the stored evaluation metric computed over the validation fold on the ith partition of the data. Depending on the size and complexity of the data set, a single iteration of k-fold cross validation might be appropriate to assess the model’s performance. However, an extension called repeated k-fold cross validation allows for repetitions of the above process over multiple random shuffles of the data, which can further aid in reducing a model’s variance.
  • Leave one out cross validation: This is a special case of k-fold cross validation in which the size of each fold is 1, meaning that in a data set of m observations, the model is trained on m-1 observations and then evaluated on the remaining mth record. Just as in k-fold cross validation, the process is repeated until each observation has been used in evaluation one time, and then the average error metric is computed across all m values. As this process requires training as many models as there are observations, it can be quite computationally expensive on large enough data sets.