A corpus of text is the entire set of documents considered. The meaning of a document in Natural Language Processing is very specific to the context, as the text being analyzed could be entire journal articles or short movie reviews. A single sentence that can fit into a Dataframe can even be considered a document. The vocabulary refers to the union of all words that appear throughout the entire corpus. For example, in the following corpus
- It is cold outside today.
- I love the beach.
- Pizza is for lunch today.
The vocabulary would be {It, is, cold, outside, today, I, love, the, beach, Pizza, for, lunch}.