Natural Language Processing - (Tokenization|Parser|Text Segmentation|Word Break rules|Text Analysis)

> (Natural|Human) Language - Text (Mining|Analytics)

1 - About

Tokenization is the process of breaking input text into small indexing elements – tokens.

Parsing and Tokenization are often call Text Analysis or Analysis in NLP.

The tokens (or terms) are used either:

  • to build the index of those terms when a new document is added,
  • or to query in order to identify which documents contain the terms you are querying for.

3 - Tokenization

3.1 - Pre

Pre-tokenization analysis can include but is not limited to:

  • stripping HTML markup,
  • transforming or removing text matching arbitrary patterns or sets of fixed strings.

3.2 - During

Sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches.

3.3 - Post (Analyse Process)

There are many post-tokenization steps that can be done, including (but not limited to):

  • Stemming – Replacing/Mapping words with their stems. For instance with English stemming “bikes” is (replaced with|mapped to) “bike”; now query “bike” can find both documents containing “bike” and those containing “bikes”.
  • Stop Words Filtering – Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
  • Text Normalization – Stripping accents and other character markings can make for better searching.
  • Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.

4 - Operations

This process are used for both indexing and querying but the same operation is not always needed for both operations. For instance,

  • for indexing, a text normalization operations will increase recall because, for example, “ram”, “Ram” and “RAM” would all match a query for “ram”
  • for querying, to increase query-time precision, an operation can narrow the matches by, for example, ignoring all-cap acronyms if the searcher is interested in male sheep, but not Random Access Memory.

4.1 - Query

  • Case insensitivity, so “Analyzer” and “analyzer” match.
  • Stemming, so words like “Run” and “Running” are considered equivalent terms.
  • Stop Word Pruning, so small words like “an” and “my” don't affect the query.

5 - Algorithm

5.1 - Unicode

The Unicode Text Segmentation algorithm Annex #29 converts tokens to lowercase; and then filters out stopwords. This is the default algorithm of Lucene.

6 - Documentation / Reference