Language - Compiler compilers or (lexer|parser) generators

1 - About

You define a grammar for your language in a grammar file and run the compiler-compiler to generate a parser for this language.

compiler-compiler reads a language description file called a grammar and generates at least one (and quite often both):

  • A Lexer: This reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using patterns you specify, and generates a token stream as output.
  • A Parser: This reads a token stream (normally generated by a lexer), and matches phrases in your language via the rules (patterns) you specify, and typically performs some semantic action for each phrase (or sub-phrase) matched. Each match could invoke a custom action, write some text via StringTemplate, or generate an Abstract Syntax Tree for additional processing.

Using a compiler-compiler makes the build process a bit more involved. You have to first run the compiler-compiler on the grammar file to generate the parser, then compile both your custom code and the generated parser to create the overall program, then run (and test it).

compiler-compilers splits the work into a lexer and a parser. The lexer reads in characters and chunks them into tokens as defined by the Tokens section of the grammar file.

need to translate something in one language to other language.

A grammar (context free grammar, technically speaking) says what type of children a node can have.

DSL (Domain Specific Language)

Often in compilers, the parser outputs a tree representing the structure of the program. This tree then serves as an input to components of the compiler responsible for analysis and code generation.

The lexical analyser and parser also are responsible for generating error messages, if the input does not conform to the lexical or syntactic rules of the language.

Parsers and lexical analysers tend to be long and complex components. A software engineer writing an efficient lexical analyser or parser directly in Java has to carefully consider the interactions between rules.

It is not the up to the lexical analyser to determine whether its sequence of tokens is sensible or not, that is usually left up to the parse. The parser will detect the error and will not request any more tokens from the lexical analyser after that.

3 - Method

To construct the tree, we have many choices for selecting nodes, which can lead to the leaves to correspond to the input sequence seen so far. If we try to construct it from leaves, it is called bottom up approach. This approach usually uses a method of construction called LR parsing. If we attempt to build the tree top-down, the method usually used is called LL parsing.

4 - Tools

Antlr, javacc, sablecc, lex are not a parser or a lexical anaylzer but a generator. This means that it outputs lexical analyzers and parser according to a specification that it reads in from a file (the grammar)

4.1 - REx Parser Generator

4.2 - Lex / Yacc

Lex_(software)

Installtion - cygwin includes lex and yacc

4.3 - ANTLR

See ANTLR

4.4 - JavaCC

JavaCC is a tool used in many applications, which is much like antlr, with few features different here and there. However, it just generates Java code.

Used by: BeanShell, Calcite

4.5 - SableCC

SableCC is a compiler-compiler tool for the Java environment. It handles LALR(1) grammars (for those who remember their grammar categories). In other words it's a bottom up parser (unlike JavaCC and Antlr which are top-down).

SableCC is a bottom up parser, which takes an unconventional and interesting approach of using object oriented methodology for constructing parsers. This results in easy to maintain code for generated parser. However, there are some performance issues at this point of time. It generates output in both C++ and Java

5 - Documentation / Reference

code/compiler/compiler-compiler.txt ยท Last modified: 2018/09/24 10:35 by gerardnico