Skip to content

Choose BNF syntax for describing the grammar #342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stasm opened this issue Feb 13, 2023 · 6 comments · Fixed by #347
Closed

Choose BNF syntax for describing the grammar #342

stasm opened this issue Feb 13, 2023 · 6 comments · Fixed by #347

Comments

@stasm
Copy link
Collaborator

stasm commented Feb 13, 2023

Currently, spec/message.ebnf is written using the so-called W3C EBNF. It's one of a few BNF variations, commonly used in W3C and Unicode (I think?).

One of the nice-to-have reasons for picking it was that it's also supported by REx, an online parser generator, as well as the Railroad Diagram Generator, both by Gunther Rademacher. Having good tool support and being able to immediately test grammar ideas was beneficial to the rapid development-style of the syntax design process that we went through last year.

That said, I haven't been able to find other tools that supports this variant of BNF out of the box. This makes me a bit uncomfortable. Ideally, we would be able to define the grammar in a way that allows an arbitrary text snippet to be validated as MF2 syntax, parse it into a concrete syntax tree (CST), visualize both the rules and such tree, and even generate random strings that match the grammar, for the purpose of fuzz testing. I'm a bit disappointed by the state of the tooling in this regard.

  • The original BNF which doesn't have some quality-of-life improvements of its successors, like syntax for alternatives or repetitions.
  • ABNF (Augmented BNF) which adds alternatives (foo / bar), optional symbols ([foo]), and repetitions (n*m(foo)).
  • Despite the above, some W3C standards (notably XML) use a different notation, which is commonly referred to as the W3C EBNF.
    • It uses | for alternatives, and Kleene operators for optionals and repetitions (?, *, +).
    • This variant is also used in some of Unicode's documents.
  • There's also another thing called EBNF defined by ISO/IEC 14997, sometimes also called the Wirth syntax notation. It uses | for alternatives, brackets for optional symbols ([foo]), and curly braces for repetitions ({foo}).
  • There exist other variations, too, each with a different set of "minor modifications". E.g. a legacy search server in Windows used to have its own EBNF. Interestingly, it defined a special syntax for denoting a requiring space, which is something I wish was easily possible in other EBNFs (see whitespace in the EBNF #340).
  • Lastly, there are BNF variants oriented towards parser generation, like the ones used by Yacc and Bison. They include snippets of C code ("actions") which are used to control the shape of the parse tree.

Outside the realm of context-free grammars (CFGs), it looks like parser expression grammars (PEGs) are also a popular choice for defining grammars of programming languages. E.g. Python switched to a PEG in PEP 617.

There's also Tree-sitter developed by GitHub, which uses parser combinators written in JavaScript to define grammars, from which it can then generate parsers.

I'm opening this issue to discuss what requirements we have for the formal grammar of MessageFormat and to choose one of the available formats.

@gibson042
Copy link
Collaborator

gibson042 commented Feb 13, 2023

Possibly related: UTS #35 is not consistent about "…BNF"

RFC 5234 + RFC 7405 ABNF or W3C XML Notation are both reasonable choices, although it's worth noting that the latter has no mechanism for bounded repetition such as the current alphanum{3,8} or equivalent RFC ABNF 3*8 alphanum.

@mihnita
Copy link
Collaborator

mihnita commented Feb 13, 2023

supported by REx, an online parser generator

Note that it is also available as (Java) source, to run offline: https://www.bottlecaps.de/rex/REx.java

The main page says

Command line client
Use REx.java instead of this form for invoking REx from a command shell.

@stasm
Copy link
Collaborator Author

stasm commented Feb 13, 2023

Unfortunately, that file is just a CLI tool which makes an HTTP request to the remote server. There's no logic in the file itself.

@stasm
Copy link
Collaborator Author

stasm commented Feb 13, 2023

I'm currently leaning towards choosing ABNF. It's very well defined thanks to RFC 5234 and RFC 7405, and it looks like there are multiple tools in Java, Python, C, and JavaScript which claim to follow the RFC, which is promising.

I also found a fuzzing tool for it: https://www.quut.com/abnfgen/.

@alerque
Copy link
Contributor

alerque commented Feb 14, 2023

Just a couple links to throw in because this issue has a good summary of {,A,B}BNF variants and their usages.

  • You linked the Railroad Diagram generator, but that same domain has a useful grammar conversion tool: https://www.bottlecaps.de/convert/

  • There are even more EBNF derivatives out there, for example Lark. The motivation for Lark and many others seems to be downstream tooling trying to automatically do things with grammars often find the {,A,B}BNF deficient for what they want to accomplish and so they clean them up or extend them a little bit to work for their particular target use case.

@stasm
Copy link
Collaborator Author

stasm commented Feb 14, 2023

This is great, thanks for the pointer. The converter is one-way, to W3C EBNF, and one of the supported input grammars is ABNF. I was just able to test it with my WIP of the ABNF rewrite and it worked well. This means we can use ABNF and, when neeeded, convert to W3C EBNF and still use REx and the Railroad Diagram generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants