Text - Character

1 - About

A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646] and is categorized as a primitive data type

The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape …

A character is usually be represented as an Unicode code point where an int value from 0 to 65535 represents all Unicode code points, including supplementary code points.

Characters will not appear as intended unless you have the appropriate font (that contains the appropriate glyph)

Character are the basic unit of organization of encoded text.

3 - Example

A Character can also be simply a set of characters:

  • letters,
  • numbers,
  • symbols (mathematical),
  • ideograms,
  • logograms (from non-phonetic writing systems such as kanji),
  • etc…

For example, the following character set appears in several code pages:

  • 26 non-accented letters A through Z ( A,B,C….X,Y,Z)
  • 26 non-accented letters a through z ( a,b,c,…x,y,z)
  • digits 0 through 9
  • special characters: . , : ; ? ( ) ' “ / - _ & + % * = < >

4 - Type/Category

5 - Management

5.1 - Show

Problem: Which character is

Steps:

  • The Character Set is UTF8. We got then hexadecimal in UTF8.
echo $LANG

The Hexadecimal in UTF8 of this character is e2 80 93. It corresponds to the unicode character 2013 - EN DASH. See Translation of a UTF-8 Multibyte sequence to Unicode - Example 2. 0a is the end of file.

echo| hexdump -C
00000000  e2 80 93 0a                                       |....|
00000004

5.2 - Diff

Diff between Characters with an hex tool such as `hexdump` on Unix that output hexadecimal digits

Problem:

  • Are this two characters the same ?
–
-

Steps:

  • The Character Set is UTF8. We got then hexadecimal in UTF8.
echo $LANG
en_US.UTF-8
echo| hexdump -C
00000000  e2 80 93 0a                                       |....|
00000004
  • The Hexadecimal in UTF8 of the first character is 2d. This is the unicode character 2d - Hyphen Minus
echo  - | hexdump -C
00000000  2d 0a                                             |-.|
00000002

6 - Java

Character.toChars(int)[0]

For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

7 - Documentation / Reference

  • Bookmark "Text - Character" at del.icio.us
  • Bookmark "Text - Character" at Digg
  • Bookmark "Text - Character" at Ask
  • Bookmark "Text - Character" at Google
  • Bookmark "Text - Character" at StumbleUpon
  • Bookmark "Text - Character" at Technorati
  • Bookmark "Text - Character" at Live Bookmarks
  • Bookmark "Text - Character" at Yahoo! Myweb
  • Bookmark "Text - Character" at Facebook
  • Bookmark "Text - Character" at Yahoo! Bookmarks
  • Bookmark "Text - Character" at Twitter
  • Bookmark "Text - Character" at myAOL
text/character.txt · Last modified: 2017/01/20 21:38 by gerardnico