The website is in Maintenance mode. We are in the process of adding more features.
Any new bookmarks, comments, or user profiles made during this time will not be saved.

AIML.com

Machine Learning Resources

What is tokenization?

Bookmark this question

Tokenization is the process of separating text within documents into its smallest building blocks. The first step in tokenization is usually to separate words based on white space, which involves isolating individual words and possibly punctuation marks. However, another form of tokenization that is used in certain applications is character-based tokenization, which splits words into their individual characters. Words can be further separated if they contain contractions or numbers, which can be considered separate tokens. If words are proper nouns that refer to specific entities, they can be tokenized as such (e.g. apple the fruit has a different meaning than Apple the company). Tokenization is usually considered the first preprocessing step in Natural Language Processing.

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can

Explore Questions by Topics