Regexp - Word Characters

> Procedural Languages > Multilingual Regular Expression Syntax (Pattern)

1 - About

A “word” character (\w) is:

  • any letter
  • or any digit
  • or the underscore character

that is, any character which can be part of a Perl "word".

In regular expression, it would be expressed as : [0-9A-Za-z_]

Advertising

3 - Definition of letters and digits versus Character Set

The definition of letters and digits is controlled by character tables, and may vary if locale-specific matching is taking place.

For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

4 - Boundary

4.1 - Word boundary

A word boundary \b is a zero-width assertion that matches if:

  • there is \w on one side, and either there is \W (non-word char) on the other
  • or the position is beginning or end of string.

Example 1:

  • The regex: \bdog\b
  • Input String to search: The dog plays in the yard
  • Result: Found the text dog starting at index 4 and ending at index 7.
  • The regex: \bdog\b
  • Input string to search: The doggie plays in the yard.
  • No match found.
Advertising

4.2 - Non-word boundary

A non-word boundary is \B.

Example 1:

  • The regex: \bdog\B
  • Input string to search: The dog plays in the yard.
  • No match found.

Example 2:

  • The regex: \bdog\B
  • Input string to search: The doggie plays in the yard.
  • I found the text dog starting at index 4 and ending at index 7.

5 - Documentation / Reference