Natural Language - Text Modeling

> (Natural|Human) Language - Text (Mining|Analytics)

1 - About

Data Modeling (Relational Database, Code, Graph, Text) in Text.

How to store and represent text ?


3 - Type

For a full text search, modeling free-text in a database (text engine) is a simple matter of:

  • building an inverted file relation with tuples of the form word, documentID, position,
  • building a B+-tree index over the word column.
  • adding metadata to aid in rank-ordering search results
  • and applying some linguistic canonicalization of words

Performance optimization;

  • denormalizing the schema to have each word appear only once with a list of occurrences per word, i.e. word, list <documentID, position>. It allows for aggressive delta-compression of the list (typically called a postings list), which is critical given the characteristically skewed (Zipfian) distribution of words in documents.

See also:

3.2 - Representation

The models can be used to:

  • detect and correct spelling errors.

The N-gram language model is the most widely used language modeling approach.