Antlr - Lexer Rule (Token names|Lexical Rule)

About

Compiler - Lexer rule (Token names, Lexical rule, Token name) in Antlr.

They are rules that defines tokens.

They are written generally in the grammar but may be written in a lexer grammar file

Each lexer rule is either matched or not so every lexer rule expression is a boolean expression.

Articles Related

Syntax

Token

token - ie terminal symbol (the leaf of the parser tree)

TokenName : pattern -> lexerCommand

where:

lexerCommand

Fragment

fragment are just a pattern name (They does not produce token but they can be used in token definition to improve readibility)

fragment Name: pattern

A fragment is a special type of lexer rule that does not result in creation of tokens. They are only present to introduce logical expression that simplify the grammar.

Catch All

A catchall rule is a lexical rule:

placed at the end of the lexical grammar
that catch all characters that didn't match any rule.

The name is often ANY.

Example:

ANY : . ;

Example

ID  :   [a-zA-Z]+ ;      // match lowercase and uppercase letters from A to Z
INT :   [0-9]+ ;         // match a serie of digit from 0 to 9
DIGITS : [0-9] +; // same
NEWLINE:'\r'? '\n' ;     // match/return newlines to parser (end-statement signal)
WS  :   [ \t]+ -> skip ; // toss out whitespace and tab
HEX : ('%' [a-fA-F0-9] [a-fA-F0-9])+ ; // hexadecimal
STRING : ([a-zA-Z~] |HEX) ([a-zA-Z0-9.-] | HEX)*; // lexer rule can use other lexer rule
TEXT: ~[\])]+ ; // Capture everything apart the character \ and ) - Not class logical

Syntax

Basically the same syntax than parser rules except that lexer rules:

cannot have arguments,
cannot return values, or local variables.

Lexer rule names (known als as Token name) must begin with an uppercase letter whereas parser rule names begin with a lowercase letter.

A lexer rule can be associated with:

a single literal string expected in the input
a selection of literal strings that may be found
a sequence of specific characters and ranges of characters using the quantifier (greedy ?, * and + or lazy (??, *? and +? )

A lexer rule:

cannot be associated with a regular expression.
can refer to other lexer rules.

Order of Precedence

Grammar - (Order of (operations|precedence)|operator precedence): The lexer chooses the rule that matches the most characters. If there is a tie then the first one is used.

Documentation / Reference

https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md