Text - Encoding (Character Set|charset|code page)

1 - About

A character set is a repertoire of characters in which each character is (assigned|encoded) into a numeric code point.

In other word, it is a mapping between Strings and bytes in order to convert Strings to bytes (and vice versa) as defined in the Encoding standard. Software uses it when retrieving/writing character information into a file.

Different encodings make trade-offs between the amount of storage required for a string and the speed of operations such as indexing into a string.

In computer science, the terms:

  • charset,
  • code set
  • coded character set (CCS)
  • character encoding,
  • character map,
  • character set
  • coding representation

are historically synonymous, as the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with a single character per code unit.

The definition defined in the defined in RFC 2278 specifies that a charset is named mapping between sequences of sixteen-bit Unicode characters and sequences of bytes.

A (codepage|character encoding) is a:

  • set of rules
  • coding representation
  • table of values
  • encoding schemes
  • list of selected character

used to represent a set of characters to their on-disk representation with code points.

Charset are usually defined to support specific languages or groups of languages that share common writing systems. For example, codepage 1253 provides character codes required in the Greek writing system and codepage 1250 provides the characters for Latin writing systems including English, German and French.

As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs must translate data between encoding schemes.

3 - Decoder/Encoder

A decoder is an engine which transforms bytes in a specific charset into characters, and an encoder is an engine which transforms characters into bytes. Encoders and decoders operate on byte and character buffers. They are collectively referred to as coders.

4 - Scheme

A character encoding scheme is a character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

5 - Default

On Windows the default character encoding is Cp1252, on Unix it is usually UTF-8. For both of these encoding there are illegal byte sequences (more in UTF-8 than for Cp1252).

6 - Order

7 - Common Charset

7.1 - Unicode

7.2 - ASCII

7.3 - UCS

8 - Example

In the ASCII encoding scheme for code page 850, for example:

  • “A” is assigned code point X'41',
  • and “B” is assigned code point X'42'.

Within a code page, each code point has only one specific meaning.

9 - Single-byte, double-byte character sets and unicode

9.1 - SBCS

single-byte character set (SBCS)

Languages written in any of the Western or Middle Eastern alphabets have relatively few characters.

English, for example, has:

  • twenty-six uppercase
  • twenty-six lowercase letters,
  • ten numerals,
  • a blank space,
  • some punctuation marks,
  • and a few special symbols like the ampersand and the plus sign.

All Window codepages can only contain 256 code points. A single byte is large enough to assign a different code point to each character, so a set of code points used for English is called a single-byte code page. A font character set intended for use with a single-byte code page is called a single-byte character set (SBCS).

9.2 - DBCS

Windows codepage 1253 provides character codes required in the Greek writing system and codepage 1250 provides the characters for Latin writing systems including English, German and French.

It is the upper 128 code points that contain either the accent characters or the Greek characters. Thus you cannot store Greek and German in the same code stream unless you put some type of identifier to indicate what codepage you are referencing.

Asian languages far exceed the 256-character limit imposed by a single byte. Japanese, for example, uses about 2000 kanji for everyday purposes, more kanji for special vocabularies, two phonetic syllabaries, Latin alphabetic characters, Arabic numerals, and both Japanese and Western punctuation marks.

A different scheme needed to be developed but it had to be based on the concept of 256 character codepages. Thus DBCS (Double Byte Character Sets) were born.

It takes two bytes to assign a code point to each character, so a set of code points used for Japanese is called a double-byte code page; and a Japanese font character set is called a double-byte character set (DBCS). Programs written for single-byte code pages won't work for Asian languages.

DBCS meant that you had to write code that would treat these pair of code points as one.

Each Asian character is represented by a pair of code points (thus double-byte).

For programming awareness, a set of points are set aside to represent the first byte of the set and are not valued unless they are immediately followed by a defined second byte.

A double-byte character set (DBCS) is a character set that represents each character with 2 bytes.

The DBCS supports national languages that contain a large number of unique characters or symbols (the maximum number of characters that can be represented with 1 byte is 256 characters).

Examples of such languages include :

  • Japanese,
  • Korean,
  • and Chinese.

9.3 - Unicode

DBCS disallowed the combining of say Japanese and Chinese in the same data stream, because depending on the code page the same double-byte code points represent different characters for the different languages.

In order to allow for the storage of different languages in the same data stream, Unicode was created. This one “codepage” can represent 64000+ characters and now with the introduction of surrogates it can represent 1,000,000,000+ characters.

10 - Problems of code pages

Unicode is strongly recommended in modern applications, but many applications or data files still depend on the legacy code pages. This can cause many problems:

  • Programs need to know what code page to use in order to display the contents of files correctly. If a program uses the wrong code page it may show text as mojibake.
  • The code page in use may differ between machines, so files created on one machine may be unreadable on another.
  • Data is often improperly tagged with the code page, or not tagged at all, making determination of the correct code page to read the data difficult.
  • These Microsoft code pages differ to various degrees from some of the standards and other vendors' implementations. This isn't a Microsoft issue per-se as it happens to all vendors, but the lack of consistency makes interoperability with other systems unreliable in some cases
  • The use of code pages limits the set of characters that may be used.
  • Characters expressed in an unsupported code page may be converted to question marks (?) or other replacement characters, or to a simpler version (such as removing accents from a letter). In either case, the original character may be lost.

11 - Application Attribute

As the charset is mandatory when reading or writing binary data, the (character set|code page) is always an application attribute (on the client side but also in the server side).

11.1 - Database

For instance when an application program connects to the database, the database manager determines the code page of the application.

11.2 - SAP

(national language support = NLS)

11.3 - Windows

11.3.1 - Application

Start > Control Panel > Regional and Language Options > Languages > Text service and Input Languages

11.3.2 - Dos

To find the current console code page, issue the CHCP command in the Command Prompt window.

C:\Documents and Settings\Rixni>chcp
Page de codes active : 850

12 - File

Some editor have implemented a character set scanner that may tell you with which character set a file was saved.

For instance, in Jedit in Utilities > Buffer Options.

13 - Documentation / Reference

data/type/text/character_set.txt · Last modified: 2018/12/07 13:04 by gerardnico