When the vocabulary size is small, and the binary occurrence of given words are strong features
If the documents in the corpus are of varying sizes, the larger documents are more likely to have higher word counts
The generic list of English stop words may not be appropriate if the set of documents are all related to a specific domain.
Stop words are common words that appear often throughout a set of documents but add little information
Lemmatization is the process of reducing tokens in a document to their root form based on the context in which they appear.
Similar to other preprocessing techniques, it is considered best practice to fit the vectorizer on the train dataset
Advantages: Can provide useful information beyond just considering individual tokens
An n-gram model builds upon the bag of words approach by considering n consecutive tokens
Advantages: Often perform at a high level of accuracy for tasks where the frequency or occurrence of words are predictive features
A Bag-of-Words model is a class of models used for text mining tasks that is based on the frequency of words