Skip to content

Choose BNF syntax for describing the grammar #342

Closed
@stasm

Description

@stasm

Currently, spec/message.ebnf is written using the so-called W3C EBNF. It's one of a few BNF variations, commonly used in W3C and Unicode (I think?).

One of the nice-to-have reasons for picking it was that it's also supported by REx, an online parser generator, as well as the Railroad Diagram Generator, both by Gunther Rademacher. Having good tool support and being able to immediately test grammar ideas was beneficial to the rapid development-style of the syntax design process that we went through last year.

That said, I haven't been able to find other tools that supports this variant of BNF out of the box. This makes me a bit uncomfortable. Ideally, we would be able to define the grammar in a way that allows an arbitrary text snippet to be validated as MF2 syntax, parse it into a concrete syntax tree (CST), visualize both the rules and such tree, and even generate random strings that match the grammar, for the purpose of fuzz testing. I'm a bit disappointed by the state of the tooling in this regard.

  • The original BNF which doesn't have some quality-of-life improvements of its successors, like syntax for alternatives or repetitions.
  • ABNF (Augmented BNF) which adds alternatives (foo / bar), optional symbols ([foo]), and repetitions (n*m(foo)).
  • Despite the above, some W3C standards (notably XML) use a different notation, which is commonly referred to as the W3C EBNF.
    • It uses | for alternatives, and Kleene operators for optionals and repetitions (?, *, +).
    • This variant is also used in some of Unicode's documents.
  • There's also another thing called EBNF defined by ISO/IEC 14997, sometimes also called the Wirth syntax notation. It uses | for alternatives, brackets for optional symbols ([foo]), and curly braces for repetitions ({foo}).
  • There exist other variations, too, each with a different set of "minor modifications". E.g. a legacy search server in Windows used to have its own EBNF. Interestingly, it defined a special syntax for denoting a requiring space, which is something I wish was easily possible in other EBNFs (see whitespace in the EBNF #340).
  • Lastly, there are BNF variants oriented towards parser generation, like the ones used by Yacc and Bison. They include snippets of C code ("actions") which are used to control the shape of the parse tree.

Outside the realm of context-free grammars (CFGs), it looks like parser expression grammars (PEGs) are also a popular choice for defining grammars of programming languages. E.g. Python switched to a PEG in PEP 617.

There's also Tree-sitter developed by GitHub, which uses parser combinators written in JavaScript to define grammars, from which it can then generate parsers.

I'm opening this issue to discuss what requirements we have for the formal grammar of MessageFormat and to choose one of the available formats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions