Regexp - Character Class (Character Set)

About

A character class defines a domain of permitted characters.

They may be also known as character set (in a regular expression)

Not to confound with the character set used to encode a text into bit but you can represent a whole character set with a character class.

For instance, if your regular expression engine supports it, you could represent all ASCII characters with

[:ASCII:]

Syntax

with square brackets

[]

where:

[ is the start character class definition
] is the end character class definition

Inside the brackets, all characters can be used mixed with this Meta characters symbols

Construct	Matches	Operation
[abc]	a, b, or c	simple class
[^abc]	Any character except a, b, or c	negation
[a-zA-Z]	a through z or A through Z, inclusive	range
[a-d[m-p]]	a through d, or m through p: [a-dm-p]	union
[a-z&&[def]]	d, e, or f	intersection
[a-z&&[^bc]]	a through z, except for b and c: [ad-z]	subtraction
[a-z&&[^m-p]]	a through z, and not m through p: [a-lq-z]	subtraction

Other:

[..]	Specifies one collation element, and can be a multicharacter element	[.ch.] in Spanish
[:characterClass:]	Specifies character classes. It matches any character within the character class.	[:alpha:] See posix
[==]	Specifies equivalence classes.	[=a=] matches all characters having base letter 'a'.

Meta-character	Description	Example
\	general escape character
^	negate the class, but only if the first character	[^abc] matches any character other than a, b, or c. [^a-z] matches any single character that is not a lowercase letter from a to z.
-	indicates character range	[abc] matches a, b, or c. [a-z] specifies a range which matches any lowercase letter from a to z. These forms can be mixed: [abcx-z] matches a, b, c, x, y, and z, as does [a-cx-z]

Type

POSIX

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc…zABC…Z, while in some others as aAbBcC…zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX	ASCII	Description
[:alnum:]	[A-Za-z0-9]	Alphanumeric characters
[:alpha:]	[A-Za-z]	Alphabetic characters
[:lower:]	[a-z]	Lowercase letters
[:upper:]	[A-Z]	Uppercase letters
[:blank:]	[ \s\t]	Space and tab
[:cntrl:]	[\x00-\x1F\x7F]	Control characters
[:digit:]	[0-9]	Digits
[:graph:]	[\x21-\x7E]	Visible characters
[:print:]	[\x20-\x7E]	Visible characters and spaces
[:punct:]	[-!"#$%&'()*+,./:;<=>?@[\\\]_`{\|}~]	Punctuation characters
[:space:]	[ \t\r\n\v\f]	Whitespace characters
[:xdigit:]	[A-Fa-f0-9]	Hexadecimal digits

Unicode Set

By default, the Unicode set will show you

Unicode	Description
[:ASCII:]	the set of ASCII characters
[:Lowercase:]	the set of lowercase character
[:Lowercase_Letter:]	the set of lowercase letter

Shorthand

The shorthand syntax begins with a slash followed by a letter. See Regular Expression - Backslash Generic Character Class (shorthand)