Character Set - UTF8

Data System Architecture

About

utf version 8 bytes.

Category

UTF-8 bytes are divided in “waterproof” categories as follows:

Single Byte

Bytes 0x00 to 0x7F are single bytes, they each represent a single codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII.

Multibyte sequence Byte

Bytes 0xC0 to (currently) 0xF4 or 0xFD are header bytes in a multibyte sequence. Such a byte MUST be the first byte of its sequence and the number of “one” bits above the topmost “zero” bit indicates the number of bytes (including this one) in the whole sequence.

Trailer

Bytes 0x80 to 0xBF are trailer bytes in a multi-byte sequence. They can be any byte in the sequence except the first.

Payload

In the bytes of a multi-byte sequence, all bits after the topmost “zero” bit in each byte constitute the payload: they are data bits, and in UTF-8 the most significant bits always come first.

Invalid

Bytes 0xFE and 0xFF are always invalid anywhere in UTF-8 text.

The Unicode code space had originally been foreseen as ranging from U+0000 to U+7FFFFFFF but the current standards say that no codepoints above U+10FFFD will ever be valid; also, codepoints whose hex representation is xxFFFE or xxFFFF (where xx is anything) have been expressly designated as invalid, never to be used.

Note

Note that some hanzi are above U+20000; the UTF-8 code for them consists of four bytes, not three: e.g. 𠄣 = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3 = %F0%A0%84%A3 in “percent-escaped” HTTP coding

Translation of a UTF-8 Multibyte sequence to Unicode

Example 1

Problem: translation of the 3 Byte sequences e9 a6 ac into unicode.

Steps:

Result: Concatenated payload bits 1001.1001.1010.1100 binary, or U+99AC

Example 2

Problem: translation of the 3 Byte sequences e2 80 93 into unicode.

Steps:

Result: Concatenated payload bits 0010.0000.0001.0011 binary, or U+2013

Documentation / Reference





Discover More
Data System Architecture
How to see the difference between two characters (hyphen and dash) ?

This page shows you how to make the difference between two characters that are really visually similar. Are this two characters the same ? To solve this problem, you need to pass them to an application...
How to send an email at the command line with SMTP? Email transaction explained

This page is a how-to that describes how you can transport an email to a SMTP server at the command line using the SMTP protocol for further delivery It will show you the inner mechanisms of SMTP. Below...
Character Map 0248 00f8
Text - Character

A character is: an atomic unit of text (10646ISO/IEC 10646:2000 Character specification] is categorized as a primitive data type A character is the smallest component of written language that has...
Character Set Code Pages
Text - Encoding (Character Set|charset|code page)

A character set is a repertoire of characters in which each character is (assigned|encoded) into a numeric code point. An character set (as an alphabet) is any finite set of symbols (characters). In...
Data System Architecture
What is Unicode / Universal Coded Text Character Set (UCS)?

Unicode is a global character set that allows multilingual text to be displayed in a single application. Unicode is a acronym of Universal Coded Character Set Unicode enables the development of a single...



Share this page:
Follow us:
Task Runner