LEP-002: Parser and Grammar


Created:Saturday, August 20, 2022
Author:David Delassus
Category:Compiler Architecture
Status:FINAL

Abstract

This LEP specifies the Letlang parser design and the language used to define the grammar.

Rationale

There are many options available to write a parser:

  • using a parser generator and a grammar
  • using parser combinators
  • writing a parser by hand

When using parser generators, there are also many options available:

  • LR(k) parser generators
  • LL(k) parser generators
  • LALR(k) parser generators
  • PEG parser generators

Specification

Parsing Algorithm

The Letlang compiler first transforms a character stream (source code) into a token stream, using the logos [1] Rust crate.

Any failure to match a sequence of characters to a token results in a ParseError, with information about the location of the culprit in the character stream (filename, line and column).

If successful, the token also contains information about its corresponding location in the character stream.

The compiler then transforms the token stream into an Abstract Syntax Tree, using a PEG parser generator, with the rust-peg [2] Rust crate.

Using this crate, the compiler defines a formal context-free grammar to match a sequence of tokens (a grammar rule), and build a node in the AST.

Failure to match a grammar rule results in a ParseError with information about the location of the culprit in the character stream.

If successful, the AST node contains information about its corresponding location in the character stream.

Lexer

The lexer defines, using regular expressions, what are the expected tokens:

  • keywords
  • symbols (brackets, separators, …)
  • operators
  • literal values (booleans, numbers, strings, atoms)

The lexer skips non-significant whitespaces (space character, tabulation, newlines) and ignores comments (starting with the # symbol).

NB: Keywords token have a higher priority than identifiers. Therefore, they cannot be used as a Letlang symbol name (modules, effects, classes, functions and variables).

Grammar

The grammar associates a rule to sequences of tokens, producing an AST node.

Precedence climbing of expressions and type expressions are built into the grammar.

The grammar defines 4 kinds of high-level rules:

  • statements: located at the root of a module, they are either:
    • module definition
    • imports
    • effect/class/function declaration
  • propositions: located in a code-block, they are either:
    • inline typing constraints (let expressions)
    • expressions
  • type expressions: located in type annotations, and generic specializations
  • expression: everything else

Rejected Ideas

LR(1) parser generator

Initially, the parser was implemented with the lalrpop [3] Rust crate.

LR(1) grammars requires that every rule can be determined by looking only 1 token ahead.

This means any ambiguous rule is forbidden, as the parser generator cannot determine what rule to match.

For example:

<ambiguity> :=
    (<rule-1> | <rule-2>)+
    ;

<rule-1> :=
    "a"
    ;

<rule-2> :=
    "a" "a"
    ;

This grammar could be parsed as either:

  • <rule-1> <rule-1> <rule-1>
  • <rule-1> <rule-2>
  • <rule-2> <rule-1>

A LR(1) grammar would fail to dismiss the ambiguity, while PEG parsers will use the first rule that successfuly match, or fail if no rule matched.

References

ReferenceTitleLink
1logos Rust cratehttps://docs.rs/logos/latest/logos/
2rust-peg Rust cratehttps://docs.rs/peg/latest/peg/
3lalrpop Rust cratehttp://lalrpop.github.io/lalrpop/