About

Unicode is a global character set that allows multilingual text to be displayed in a single application.

Unicode is a acronym of Universal Coded Character Set

Unicode enables the development of a single multilingual application and deploy it worldwide with only one character set.

It's a Multi-octet character set meaning that a character can be stored on more than one octet.

Therefore, it allows to represent a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)

Unicode is the universal character set that supports:

  • most of the currently spoken languages of the world.
  • and many historical scripts (alphabets).

Range

Historically, the designers of Unicode miscalculated the total of code points and originally thought that Unicode would need no more than <math>2^{16}</math> code points. The original standard UCS-2 16-bit encoding was born.

The standard expanded to its current range of over <math>2^{20}</math> code points. The new increased range is organized into 17 subranges of <math>2^{16}</math> code points each.

  • The first of these, known as the Basic Multilingual Plane (or BMP), consists of the original <math>2^{16}</math> code points.
  • The additional 16 ranges are known as the supplementary planes.

<math>2^{20} = 1,048,576</math>

Unicode Standard Bit Total of code point Bytes First code point Last code point
UTF-8 8 <math>2^{8}</math> 1
UCS-2 16 <math>2^{16}</math> 2 0x0 0xFFFF (65,535‬)
UTF-16 16 <math>2^{16}</math> 2 0x0 0xFFFF (65,535‬)
UTF-32 32 <math>2^{32}</math> 4 0x0 0x10FFFF (‭1,114,111‬ decimal)

Structure

It specifies:

Form

Scheme

Unicode allows multiple different binary encodings schemes of code points.

The most popular standard encodings of Unicode are:

The rest being:

  • UTF-16BE,
  • UTF-16LE,
  • UTF-32BE,
  • and UTF-32LE.

Notation/Encoding

Hexadecimal

Unicode code points are denoted as U+hhhh, where “hhhh” is a sequence of at least four, and at most six hexadecimal digits.

Characters are denoted using the notation used in the Unicode Standard, that is, an optional U+ followed by their hexadecimal number, using at least 4 digits, such as “U+1234” or “U+10FFFD”.

In XML or HTML this could be expressed as “&#x1234;” or “&#x10FFFD;”.

Binary

From the hexadecimal form, you can always go the binary form

Data

The UnicodeData file is part of the Unicode Character Database maintained by the Unicode Consortium. This file specifies various properties including name and general category for every defined Unicode code point or character range.

The file and its description are available from the Unicode Consortium at: http://www.unicode.org

Specifically:

Example: scripts

Detection

See BOM (byte order mark)

Unicode and Computer Language

Unicode is the native encoding of many technologies, including:

  • Java,
  • XML,
  • XHTML,
  • ECMAScript (Javascript),
  • and LDAP.

To ASCII Punycode

Unicode characters may be translated into the ASCII character set via the punycode encoding.

Invalid Sequence to Replacement Character

When printing UTF8 data bytes, if an invalid sequence is found, the invalid sequence is replaced generally with the � character (ie the U+FFFD REPLACEMENT CHARACTER)

Documentation / Reference