AIML.com

Machine Learning Resources

What is meant by Corpus and Vocabulary in Natural Language Processing?

A corpus of text is the entire set of documents considered. The meaning of a document in Natural Language Processing is very specific to the context, as the text being analyzed could be entire journal articles or short movie reviews. A single sentence that can fit into a Dataframe can even be considered a document. The vocabulary refers to the union of all words that appear throughout the entire corpus. For example, in the following corpus

  1. It is cold outside today.
  2. I love the beach.
  3. Pizza is for lunch today.

The vocabulary would be {It, is, cold, outside, today, I, love, the, beach, Pizza, for, lunch}. 

Partner Ad