NLP - (Text|Token) Normalization - Case Insensitivity - lemmas

Text Mining

About

A normalization process remove insignificant differences between otherwise identical words to make for better searching

Would you search for déjà vu, or just for deja vu?

lemmas are normalized word forms.

Operation List

  • uppercase to lowercase
  • Stripping accents, diacritics and other character marking like ´, ^, and ¨. So that a search for rôle will also match role, and vice versa. It will make esta, ésta, and está all searchable as the same word

Documentation / Reference





Discover More
Text Mining
Natural Language Processing - (Tokenization|Parser|Text Segmentation|Word Break rules|Text Analysis)

Tokenization is the process of breaking input text into small indexing elements – tokens. Parsing and Tokenization are often call Text Analysis or Analysis in NLP. The tokens (or terms) are used either:...



Share this page:
Follow us:
Task Runner