Lexical Analysis - Lexer - Lexical (Analyzer|Tokenizer|Scanner)
Table of Contents
1 - About
Lexer are also known as:
- Lexical analyzer
- Lexical Tokenizer
- Lexical Scanner
A lexer defines how the contents of a file is broken into tokens.
A lexer reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using:
- and generates a token stream as output.
- lexical analysis,
- or simply scanning.
A Lexer is a stateful stream generator (ie the position in the source file is saved). Every time it is advanced, it returns the next token in the Source. Normally, the final Token emitted by the lexer is a EOF and it will repeatedly return the same EOF token whenever called.
Lexers are generally quite simple and does nothing with the tokens. Most of the complexity is deferred to the following steps:
For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each “(” is matched with a “)”. This syntax analysis is left to the parser.
Lexers can be generated by automated tools called compiler-compiler.
2 - Articles Related
3 - Example
Consider the following expression:
sum = 3 + 2;
Tokenized in the following symbol table:
|;||End of statement|
4 - Steps
Tokenizing is generally done in a single pass.
- Categorize each lexeme (sequence of characters) into tokens (symbols of the vocabulary of the language). If the lexer finds an invalid token, it will report an error.