What is Text? String or Character?

Data System Architecture

About

A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]

Every unit of text (character) is assigned a unique integer known as a code point.

All the characters within a string have a common coding representation (ie character set) that translates a code point to a glyph (visual character representation).

The Text representation unit in computer language is a character or a String.

Without an associated data schema (such as Java script, XML, …), a text is primarily said to be unstructured.

Text is the basis of any language:

Text Editors use also often a text tree (wiki/Rope_(data_structure)) to speed up text transformation.

Structure

Regular Expressions defined the structure of text.

Attack

Many different characters look alike and they may be the cause of attack. See Characters - Homograph

Operation

Text seems at first hand easy but it's not.

Below you can find a couple of text operations:

  • Code Page/Character set Conversion: Convert text data to or from a code page
  • Collation: Compare strings according to the conventions and standards of a particular language, region, or country.
  • Formatting: Format numbers, dates, times, and currency amounts according to the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc.
  • Bidi (Bidirectionality): support for handling text containing a mixture of left-to-right (English) and right-to-left (Arabic or Hebrew) data.
  • Text Boundaries: Locate the positions of words, sentences, and paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.
Task Runner