What are the advantages and disadvantages of Bag-of-Words model?

Advantages:

  • Often perform at a high level of accuracy for tasks where the frequency or occurrence of words are predictive features
  • Easy to implement (scikit learn has API for count vectorization, TF-IDF)

Disadvantages:

  • If accounting for the order of the word sequence is important to the task, the Bag of Words approach will likely not be suitable (i.e. text generation, chatbots)
  • Can run into issues in computation as well as differentiating between vectors when the size of the vocabulary is large (high dimensional datasets)