Bag of Words (BoW) is a common Natural Language Processing method that is used to represent text documents of ‘varying lengths’ into ‘fixed length’ vectors of word frequencies. These vectors ignore the grammatical structure of sentences, and the order of words.
Machine learning models, which are mathematical models, work on numerical data. Therefore, we need a way to represent textual data as numbers. Bag of Words is one such way of representing documents, where length of the generated vector is same as the vocabulary size of the corpus.
The process of converting text into a bag of words involves:
- Tokenization: Divide the text into smaller units called tokens, usually words or phrases.
- Counting word frequencies: Create a vocabulary of all the unique words in the text corpus, and count the number of times each word appears in each document.
- Encoding the data: Encode the text data as numerical values by creating a vector for each document, with each element of the vector representing the frequency count of a particular word in the document.
The resulting numerical representation of the text data, encoded as a vector of word frequencies, is known as a “bag of words” model. This compact representation is useful because it allows text data to be easily compared and processed using mathematical and statistical methods, making it a popular technique for text classification, clustering, information retrieval, and other NLP tasks.
In this video, Ritvik Kharkar does a great job explaining the Bag of Words (BoW) model using examples. Some notes:
- Only the initial 4 mins of the video correspond to BoW model
- [Minor Correction in the video]: IDF stands for ‘Inverse Document Frequency’ and not ‘Inter Document Frequency’