Regexp - Character Class

> Procedural Languages > Multilingual Regular Expression Syntax (Pattern)

1 - About

A character class defines a domain of permitted characters.

Advertising

3 - Syntax

[]

where:

  • [ is the start character class definition
  • ] is the end character class definition

Inside the brackets, all characters can be used mixed with this Meta characters symbols

Construct Matches Operation
[abc] a, b, or c simple class
[^abc] Any character except a, b, or c negation
[a-zA-Z] a through z or A through Z, inclusive range
[a-d[m-p]] a through d, or m through p: [a-dm-p] union
[a-z&&[def]] d, e, or f intersection
[a-z&&[^bc]] a through z, except for b and c: [ad-z] subtraction
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] subtraction

Other:

[..] Specifies one collation element, and can be a multicharacter element [.ch.] in Spanish
[:characterClass:] Specifies character classes. It matches any character within the character class. [:alpha:] See posix
[==] Specifies equivalence classes. [=a=] matches all characters having base letter 'a'.
Advertising

4 - Meta

Meta-character Description Example
\ general escape character
^ negate the class, but only if the first character [^abc] matches any character other than a, b, or c.
[^a-z] matches any single character that is not a lowercase letter from a to z.
- indicates character range [abc] matches a, b, or c. [a-z] specifies a range which matches any lowercase letter from a to z.
These forms can be mixed:
[abcx-z] matches a, b, c, x, y, and z“,
as does
[a-cx-z]''

5 - Type

5.1 - POSIX

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc…zABC…Z, while in some others as aAbBcC…zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX ASCII Description
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:alpha:] [A-Za-z] Alphabetic characters
[:lower:] [a-z] Lowercase letters
[:upper:] [A-Z] Uppercase letters
[:blank:] [ \t] Space and tab
[:cntrl:] [\x00-\x1F\x7F] Control characters
[:digit:] [0-9] Digits
[:graph:] [\x21-\x7E] Visible characters
[:print:] [\x20-\x7E] Visible characters and spaces
[:punct:] [-!"#$%&'()*+,./:;<=>[email protected][\\\]_`{|}~] Punctuation characters
[:space:] [ \t\r\n\v\f] Whitespace characters
[:xdigit:] [A-Fa-f0-9] Hexadecimal digits
Advertising

5.2 - Shorthand

Generic character types Description Equivalent to
\d any decimal digit [0-9]
\D any character that is not a decimal digit [^0-9]
\h any horizontal whitespace character
\H any character that is not a horizontal whitespace character
\s any whitespace character [\f\n\r\t\v​\u00a0\u1680​\u180e\u2000​-\u200a​\u2028\u2029\u202f\u205f​\u3000\ufeff]
\S any character that is not a whitespace character [^ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​-\u200a​\u2028\u2029\u202f\u205f​\u3000\ufeff]
\v any vertical whitespace character
\V any character that is not a vertical whitespace character
\w any ”word“ character [A-Za-z0-9_]
\W any “non-word” character [^A-Za-z0-9_]

These character type sequences can appear both inside and outside character classes. They each match one character of the appropriate type. If the current matching point is at the end of the subject string, all of them fail, since there is no character to match.

lang/regexp/class.txt · Last modified: 2017/10/23 16:25 by gerardnico