The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

Machine Learning Resources

What happens to new words that appear in Test dataset but are not present in Training Data?

Bookmark this question

Similar to other preprocessing techniques, it is considered best practice to fit the vectorizer on the train dataset and then transform the test dataset using the parameters learned from only the training data. If a word appears in the test dataset that was not seen when the vectorizer was fit to the training data, it will essentially be ignored, as it was not part of the vocabulary learned by the vectorizer. One work around to this issue is to create a rule that assigns the rarest tokens to an umbrella word that encompasses all such words in the vocabulary, sort of like creating an “Other” category when performing binning or discretization. Ultimately, it is desired to perform a train/test split in such a manner so this does not occur.

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics

Partner Ad

Learn Data Science with Travis - your AI-powered tutor |