Language - Formal (Grammar|Syntax|Structure|Language)

1 - About

Grammar in the context of a compiler. Ie how to define it.

See the grammar section if you are interessed on the structure.

In processing computer languages, semantic processing generally comes after syntactic processing, but in some cases semantic processing is necessary for complete syntactic analysis, and these are done together or concurrently.

3 - Different Grammar Type

4 - Level

Computer language syntax is generally distinguished into three levels:

  • Words – the lexical level, determining how characters form tokens. In a compiler, this is the lexer works to discover Words.
  • Phrases – the grammar level, narrowly speaking, determining how tokens form phrases. In a compiler, this is the parser works to discover Words.
  • Context – determining what objects or variables names refer to, if types are valid, etc. In a compiler, this is the work of the semantic analysis to add context.

5 - Hierarchy

6 - Example

6.1 - Notation

For example, a correct sentence always consists of a subject followed by a predicate.

This rule can be expressed with formula's that are called:

  • syntax rules,
  • syntax productions,
  • or syntactic equations.

This above sentence can be described by the following formula:

sentence = subject predicate.

where:

  • Subject and predicate are syntactic classes

If we add to this formula the two further formulas

subject   = "John" | "Mary".
predicate = "eats" | "talks".

where:

  • the symbol | is “or”

We define herewith exactly four possible sentences, namely

John eats         Mary eats
John talks        Mary talks

6.2 - Shorthand notation

S = AB.
A = "a" | "b".
B = "c" | "d".
L = {ac, ad, bc, bd}

where:

  • The set L of sentences which can be generated is called the language. It consists of only four sentences.

Nesting is a property particularly important in the definition of structured languages.

6.3 - Graphical representation

The graphical representations are:

6.3.1 - Syntax Diagram / Railroad

A Railroad is made of:

  • a main diagram
  • a set of syntax diagrams.

Each diagram has an entry point and an end point.

  • Terminals are represented by round boxes
  • Non-terminals are represented by square boxes.

It's describes possible paths going through other non-terminals and terminals. To belong to the language, an expression must describe a path starting form the main diagram.

Library:

  1. Rail Road Generator: Javascript + svg (not open source)
  2. Railroad-diagrams Javascript + svg
  3. Weltraumschaf/ebnf Php + Image, plugin dokuwiki

Example:

6.4 - Language

A language is defined by the following symbol type:

Symbol Type Also Called Cardinality Definition Example
terminal
term set
(vocabulary)
These symbols are said to be terminal,
because they cannot be substituted by any other symbols.
John, Mary, eats, talks.
non-terminal identifier set They denote syntactic (classes|category) and can be substituted. Sentence, Subject and predicate
syntactic equations productions set These define the possible substitutions of non-terminal symbols.
An equation is specified for each non-terminal symbol.
start 1 This symbol is a non-terminal symbol. Sentence

Syntactic equations - Backus Naur Form (BNF), Extended BNF (EBNF) and ABNF

A language is, therefore, the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions.

The idea of defining languages and their grammar with mathematical precision goes back to N. Chomsky. It became clear, however, that the presented, simple scheme of substitution rules was insufficient to represent the complexity of spoken languages. This remained true even after the formalisms were considerably expanded. In contrast, this work proved extremely fruitful for the theory of programming languages and mathematical formalisms. In passing, we emphasize that this rigour applied to the syntax only, not to the semantics.

The use of the Chomsky formalism is also responsible for the term programming language, because programming languages seemed to exhibit a structure similar to spoken languages. This term is rather unfortunate, because a programming language is not spoken, and therefore is not a language in the true sense of the word. Formalism or formal notation would have been more appropriate terms.

6.5 - ambiguous language

The structure of a (well formed) sentence is relevant, because it is instrumental in establishing the sentence's meaning. Owing to the syntactic structure, the individual parts of the sentence and their meaning can be recognized independently, and together they yield the meaning of the whole.

expression = number | expression "+" expression.
number = "1" | "2" | "3" | "4" .

“4 + 2 + 1” is a well-formed expression. This expression may even be derived in several ways, each corresponding to a different structural trees, as shown below:

The two differing structures may also be expressed with appropriate parentheses, respectively :

  • (4 + 2) + 1 = 7
  • 4 + (2 + 1) = 7

Fortunately, thanks to the associativity of addition both yield the same value 7.

But this need not always be the case. The mere use of subtraction in place of addition yields a counter example which shows that the two differing structures also yield a different interpretation and result:

  • (4 - 2) - 1 = 1
  • 4 - (2 - 1) = 3.

The example illustrates two facts:

  • Interpretation of sentences always rests on the recognition of their syntactic structure.
  • Every sentence must have a single structure in order to be unambiguous.

We call a syntactic class ambiguous if it can be attributed several structures. A language is ambiguous if it contains at least one ambiguous syntactic class (construct).

6.6 - context-free languages

7 - Documentation

code/compiler/grammar.txt · Last modified: 2018/10/25 21:40 by gerardnico