The key advantages and disadvantages of the bag of words model are summarized below. For an introduction about this model, please see What is Bag of Words model question.
Advantages of Bag of Words Model
- Simplicity and explainability: The bag of words model is a simple representation of text data that is easy to understand and implement
- Ease of implementation: It requires minimal preprocessing (text cleaning and tokenization), and therefore is quick and easy to implement
- Sparsity: The bag of words model is sparse, meaning that most of the entries in the feature vector are zero. This makes it efficient to store and process large amounts of text data.
- Scalability: Since it is a sparse model that is easy to implement, it scales well for a large number of documents both in terms of space and compute
- Generalizability: The bag of words model can be applied to a wide range of NLP tasks, including text classification, information retrieval, clustering, and document similarity. Generally, it is used as an input to more complicated NLP models across different applications.
Disadvantages of Bag of Words Model
- Insensitivity to word order: The bag of words model treats all occurrences of a word as equivalent, regardless of the order in which they appear in a sentence. This means that it cannot capture the relationships between words in a sentence and the meaning they convey.
For example, the bag of words (BoW) representation is same for the following two sentences with completely different meanings:
- Insensitivity to grammatical structure: Bag of words model ignores punctuations and grammatical structures, thereby representing sentences with different meanings by same BoW vector.
For example, the following sentences have very different meanings, and yet both of them are represented by same BoW vector, as commas get ignored during text preparation process.
- Limited semantic information: The bag of words model only captures the presence or absence of a word in a document, not the meaning or context in which it appears.
For example, in the following sentence the word “Pitcher” means two very different things, yet it is represented by same feature in the BoW representation
- High dimensionality: If a corpus contains a large number of unique words, the bag of words representation will have a high dimensionality, which can lead to overfitting due to curse of dimensionality.
- Ignoring new words: Since the BoW vector length is same as the number of unique words in the corpus, it makes it difficult to introduce new words, as that requires re-computing vectors for all documents. In order to avoid recomputing vectors, new words are either ignored, or are mapped to a special token called UNK
Is ‘Sparsity’ an advantage or disadvantage for Bag of Words model?
Many a times, people say that ‘Sparsity’ is a disadvantage for Bag of Words model. Technically, sparsity is not a disadvantage, but ‘High Dimensionality’ (resulting from sparsity) is a disadvantage. Sparsity is actually advantageous because it allows for efficient storage and processing of document vectors.