Parser and Grammar
Parser and Grammar
Parser and Grammar
Created: | Saturday, August 20, 2022 |
---|---|
Author: | David Delassus |
Category: | Compiler Architecture |
Status: | FINAL |
This LEP specifies the Letlang parser design and the language used to define the grammar.
There are many options available to write a parser:
When using parser generators, there are also many options available:
The Letlang compiler first transforms a character stream (source code) into
a token stream, using the logos
[1] Rust crate.
Any failure to match a sequence of characters to a token results in a
ParseError
, with information about the location of the culprit in the
character stream (filename, line and column).
If successful, the token also contains information about its corresponding location in the character stream.
The compiler then transforms the token stream into an Abstract Syntax Tree,
using a PEG parser generator, with the rust-peg
[2]
Rust crate.
Using this crate, the compiler defines a formal context-free grammar to match a sequence of tokens (a grammar rule), and build a node in the AST.
Failure to match a grammar rule results in a ParseError
with information
about the location of the culprit in the character stream.
If successful, the AST node contains information about its corresponding location in the character stream.
The lexer defines, using regular expressions, what are the expected tokens:
The lexer skips non-significant whitespaces (space character, tabulation,
newlines) and ignores comments (starting with the #
symbol).
NB: Keywords token have a higher priority than identifiers. Therefore, they cannot be used as a Letlang symbol name (modules, effects, classes, functions and variables).
The grammar associates a rule to sequences of tokens, producing an AST node.
Precedence climbing of expressions and type expressions are built into the grammar.
The grammar defines 4 kinds of high-level rules:
let
expressions)Initially, the parser was implemented with the lalrpop
[3]
Rust crate.
LR(1) grammars requires that every rule can be determined by looking only 1 token ahead.
This means any ambiguous rule is forbidden, as the parser generator cannot determine what rule to match.
For example:
<ambiguity> :=
(<rule-1> | <rule-2>)+
;
<rule-1> :=
"a"
;
<rule-2> :=
"a" "a"
;
This grammar could be parsed as either:
<rule-1> <rule-1> <rule-1>
<rule-1> <rule-2>
<rule-2> <rule-1>
A LR(1) grammar would fail to dismiss the ambiguity, while PEG parsers will use the first rule that successfuly match, or fail if no rule matched.
Reference | Title | Link |
---|---|---|
1 | logos Rust crate | https://docs.rs/logos/latest/logos/ |
2 | rust-peg Rust crate | https://docs.rs/peg/latest/peg/ |
3 | lalrpop Rust crate | http://lalrpop.github.io/lalrpop/ |