What is Lemmatization?

Lemmatization is the process of reducing tokens in a document to their root form based on the context in which they appear. For example, “run”, “ran”, and “running” might appear separately within the document, but for all practical purposes, they should be treated as the same token “run”. A simpler form of Lemmatization is Stemming, which does something similar in terms of reducing words to their base form, but it just chops off the end of words using a set of heuristics rather than actually using language-based rules. Thus, Lemmatization is considered a more sophisticated approach than Stemming. This process of reducing words to their base forms is beneficial for reducing the dimensionality of the vocabulary, as for all practical purposes, there is no reason for two tokens like “run” and “running” to be considered separately. For Lemmatization to work effectively, part of speech (POS) tagging should be performed prior to lemmatizing the text, as in some cases, a word could have a completely different base form depending on if it is appearing in a sentence as a noun versus a verb, for instance.