Skip to content

Add a new grammar renderer #1787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Apr 15, 2025
Merged

Add a new grammar renderer #1787

merged 38 commits into from
Apr 15, 2025

Conversation

ehuss
Copy link
Contributor

@ehuss ehuss commented Apr 10, 2025

This introduces a new grammar renderer. Instead of trying to write the grammar in markdown/html hybrid, this introduces a new syntax that is parsed by the mdbook-spec plugin. This grammar is then converted into markdown/html hybrid, and also to railroad diagrams.

There are a lot of changes here (and some can be split into separate PRs if desired). A general overview of what to see here:

  • Grammar rules are now written inside a code block (instead of a blockquote). The syntax is pretty similar to the old syntax with various small changes. See the docs/grammar.md file for a complete description.
  • There is now a summary chapter which shows the entire grammar all on one page.
  • The grammar is parsed by mdbook-spec/src/grammar/parser.rs into an internal representation.
  • The internal representation is converted to markdown in mdbook-spec/src/grammar/render_markdown.rs, and railroad diagrams in mdbook-spec/src/grammar/render_railroad.rs.
    • The railroad diagrams are generated using the railroad crate.
    • There is a toggle button the show/hide the railroad diagrams. It uses localstorage to keep that state sticky.
  • The basic definitions and driver in the mdbook plugin is in mdbook-spec/src/grammar.rs. There are several pieces here:
    • The internal representation.
    • Code to load the grammar from the code blocks inside the chapters.
    • Some validation.
    • Code that will replace the code block with the rendered output.
    • Code for handling the summary chapter.
  • All nonterminals are now linked to the rule definition.
  • The text may now link to grammar rules by just putting them in brackets like [FunctionParameters]. Link definitions are automatically added to every page.
  • Some rules were added or changed to accommodate the new renderer. I think all changes are put into separate commits to help with reviewing.
  • Various misc fixes, see the individual commits.

I'd like to thank @lukaslueg for creating the railroad library which made this possible.

Closes #221
Closes #398
Closes #596
Closes #1513
Closes #1677

ehuss added 18 commits April 10, 2025 14:31
Just fixing some small consistency and spacing mistakes.
This rule was misnamed, colliding with the existing CfgAttrAttribute.
This renames IsolatedCR to CR. I felt like it wasn't exactly necessary
since we have rewritten things so that it is clear that there is an
input transformation which resolves this (`input.crlf`). We also never
really defined what it meant.

I also felt like there was room for confusion. For example, an input
containing `CR CR LF LF` would get normalized to `CR LF`. The `CR` there
is not isolated.
This removes all backslash escaped characters. This helps to avoid
confusing similarities with a literal backslash followed by a character
versus the interpreted escaped character.
I don't exactly know why this was placed there, but we operate under the
assumption that all lexical characters immediately follow one another.
This introduces a new terminal kind that I'm calling a "prose" which
describes what the terminal is. This is inspired by the IETF format
which uses angle brackets to describe terminals in English.
The grammar almost always uses lowercase, so let's standardize on that.
This helps to standardize how suffixes are written. Normally they do not
use parentheses, and visually I don't think they entirely necessary.
These two nonterminals were using the wrong name for the productions for
BlockExpression and LiteralExpression.
This changes the keyword listings so that they are just lists instead of
lexer rules. We never used the named rules, and I don't foresee us ever
doing that. Also, the `IDENTIFIER_OR_KEYWORD` rule meant that we never
needed to explicitly identify these keywords as lexer tokens.

This helps avoid problems when building the grammar graph for missing
connections.
Per our style, edition differences are supposed to be separated out into
an edition block.
These were defined in prose below, but defining them here allows us to
easily refer and link to them.
This is intended to help define what a "token" is via the grammar (and
to fill a missing hole in our token definition).

I waffled on how to define delimiters, whether they should be separate
somehow. In practice I think it should be fine to clump them all
together. This mainly only matters for TokenTree which already excludes
the delimiters.
This adds a grammar rule that collects all the reserved token forms into
a single production rule so that we can define what a "token" is by
referring to this.
This defines a Token in the grammar so that we can easily refer to it
(and to make it easier to see what all the tokens are).
We no longer represent characters via escape sequences. These can be
confused with the literal two bytes of backslash followed by a
character. See the "common productions" list for how these are now
referred to.
@rustbot rustbot added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Apr 10, 2025
@ehuss ehuss force-pushed the railroad-grammar branch from a8be867 to 4a8e44f Compare April 11, 2025 01:23
@lukaslueg
Copy link
Contributor

railroad upstream here.

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

AFAICS there are cases where the grammar diverges from its graphical representation with respect to repeated elements. In the two examples below, the diagram only allows for for at least two consecutive Statement (... two consecutive Expression), while the grammar requires "at least one".

Bildschirmfoto 2025-04-11 um 14 15 55
Bildschirmfoto 2025-04-11 um 14 16 43

@ia0
Copy link

ia0 commented Apr 11, 2025

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

From the live demo I can see that *-repeated elements are in theory printed like this:

   .------->-------.
   |               |
->-+-+--[ foo ]--+-+--->-
     |           |
     '-----<-----'

What about doing it like this?

->-+------>------+->-
   |             |
   '-<-[ foo ]-<-'

This uses one less path and can be concatenated with a previous foo for +-repeat:

->-[ foo ]->-+------>------+->-
             |             |
             '-<-[ foo ]-<-'

The main problem is that foo is somehow to be read backwards, which may confuse people at first.

@lukaslueg
Copy link
Contributor

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

From the live demo I can see that *-repeated elements are in theory printed like this:

   .------->-------.
   |               |
->-+-+--[ foo ]--+-+--->-
     |           |
     '-----<-----'

What about doing it like this?

->-+------>------+->-
   |             |
   '-<-[ foo ]-<-'

This uses one less path and can be concatenated with a previous foo for +-repeat:

->-[ foo ]->-+------>------+->-
             |             |
             '-<-[ foo ]-<-'

The main problem is that foo is somehow to be read backwards, which may confuse people at first.

With respect to *-Elements, both examples (1 and 2) are technically valid. See the Zero or more table-constraints block in the create-table-stmt example, which demonstrates the second case.
Also see the One or more column-definitions-example, which should cover the +-Element case in example 3.

It's possible to implement dyn Node downstream to build more specialized primitives for certain situations. For instance, one might want to cook up a graphical representation for the "any character except ..."-case. Upstream might also provide them, if the need arises.

@ia0
Copy link

ia0 commented Apr 11, 2025

I see, so that's already supported and just a matter of generating the proper diagram downstream.

ehuss added 6 commits April 11, 2025 08:53
This adds an extension to mdbook-spec that will parse code-blocks in a
BNF-style grammar into a rendered format, in both markdown or as
railroad diagrams.
This adds the hooks to toggle the visibility of the railroad grammar.
The status is stored in localstorage to keep it sticky.
This fixes it so that rule links work correctly if there is more than
one space in a reference definition.
@lukaslueg
Copy link
Contributor

lukaslueg commented Apr 14, 2025

AFAICS there are some problems left in 1214d68. I'm not sure on how to proceed here, since this PR is already quite large, and not just about eyecandy-diagrams. Ping me if you need my two cents.

Bildschirmfoto 2025-04-14 um 12 54 20

@traviscross
Copy link
Contributor

traviscross commented Apr 14, 2025

The conflicting directions one would be resolved by #1787 (comment). The UNICODE_ESCAPE one is actually the correct grammar. E.g.:

fn main() {
    let x: &str = "\u{a____________________________________}";
    println!("_{x}_");
}

Playground link

(On INNER_LINE_DOC, were you meaning to point out some problem by highlighting it?)

@lukaslueg
Copy link
Contributor

The UNICODE_ESCAPE one is actually the correct grammar. E.g.:

👀 I was actually to lazy to check, sorry for the confusion. [The syntax is somewhat hilarious?!]

(On INNER_LINE_DOC, were you meaning to point out some problem by highlighting it?)

On INNER_LINE_DOC, I was highlighting the fact that reading direction - graphically indicated by the arrows - is correct (green), while in INNER_BLOCK_DOC reading direction gets inverted on char [CHAR]-branch (red); as a mental image: "two trains would collide head-on in the red sections".

We track the "roots" in our grammar -- those productions that aren't
used in any other production.  We want to report when a new root
appears or when something that's expected to be a root no longer is
one.  However, we were reporting the latter case as the former instead
of reporting it separately as intended.  Let's fix that.
traviscross added a commit to ehuss/reference that referenced this pull request Apr 14, 2025
There are two ways to render a "zero or more" (i.e. `*`) repeat.  One
is to put nothing on the main forward line and to put the pattern on
the recurrent edge, and the other is to put the pattern on the main
forward line and to have an empty recurrent edge and an empty bypass
edge.

That is, for the latter, we can think of `thing*` as `(thing+)?`.

Doing it that latter way means an additional edge, but it buys us
something big in return, which is that it keeps all the patterns going
in the forward direction.  Doing it the other way means the patterns
have to be reversed so as to put them underneath on that recurrent
edge, and it means that readers then have to read them right to left.

Reversing the elements also causes a bug in some diagrams where the
lines end up running in opposing directions and so the trains crash
into each other.  See:

- rust-lang#1787 (comment)

Keeping things in the forward direction avoids this problem.

In this commit, we'll leave in place all the infrastructure for
reversing the elements though it is no longer used.  We can of course
pull this out later.
@traviscross
Copy link
Contributor

I've pushed up a set of commits. I had originally planned to merge this first and do these separately, but they're somewhat intertwined with fixing issues that we probably should fix here, so perhaps it's best to look at these now.

traviscross added a commit to ehuss/reference that referenced this pull request Apr 14, 2025
There are two ways to render a "zero or more" (i.e. `*`) repeat.  One
is to put nothing on the main forward line and to put the pattern on
the recurrent edge, and the other is to put the pattern on the main
forward line and to have an empty recurrent edge and an empty bypass
edge.

That is, for the latter, we can think of `thing*` as `(thing+)?`.

Doing it that latter way means an additional edge, but it buys us
something big in return, which is that it keeps all the patterns going
in the forward direction.  Doing it the other way means the patterns
have to be reversed so as to put them underneath on that recurrent
edge, and it means that readers then have to read them right to left.

Reversing the elements also causes a bug in some diagrams where the
lines end up running in opposing directions and so the trains crash
into each other.  See:

- rust-lang#1787 (comment)

Keeping things in the forward direction avoids this problem.

In this commit, we'll leave in place all the infrastructure for
reversing the elements though it is no longer used.  We can of course
pull this out later.
ehuss pushed a commit to ehuss/reference that referenced this pull request Apr 14, 2025
There are two ways to render a "zero or more" (i.e. `*`) repeat.  One
is to put nothing on the main forward line and to put the pattern on
the recurrent edge, and the other is to put the pattern on the main
forward line and to have an empty recurrent edge and an empty bypass
edge.

That is, for the latter, we can think of `thing*` as `(thing+)?`.

Doing it that latter way means an additional edge, but it buys us
something big in return, which is that it keeps all the patterns going
in the forward direction.  Doing it the other way means the patterns
have to be reversed so as to put them underneath on that recurrent
edge, and it means that readers then have to read them right to left.

Reversing the elements also causes a bug in some diagrams where the
lines end up running in opposing directions and so the trains crash
into each other.  See:

- rust-lang#1787 (comment)

Keeping things in the forward direction avoids this problem.

In this commit, we'll leave in place all the infrastructure for
reversing the elements though it is no longer used.  We can of course
pull this out later.
@ehuss ehuss force-pushed the railroad-grammar branch from a2515e4 to bb5862e Compare April 14, 2025 21:18
We check that the list of grammar "roots" -- that is, productions that
are not used in any other production -- is what we expect it to be.

We had hard coded this list of roots in `mdbook-spec`.  Let's instead
add a way to specify this in our syntax for productions by prefixing
the production with `@root`.
When reviewing a production in the grammar, one often wants to quickly
find the corresponding railroad diagram, and when reviewing a railroad
diagram, one often wants to quickly find the corresponding production
in the grammar.

Let's make this easy by linking each production in the grammar to the
corresponding railroad diagram, and from the name of each railroad
diagram to the corresponding production in the grammar.

When clicking on a production in the grammar, we'll automatically
display the railroad diagrams if those are not already displayed.
We can save a line by replacing this `match` with a `let-else`, so
let's do that.
There are two ways to render a "zero or more" (i.e. `*`) repeat.  One
is to put nothing on the main forward line and to put the pattern on
the recurrent edge, and the other is to put the pattern on the main
forward line and to have an empty recurrent edge and an empty bypass
edge.

That is, for the latter, we can think of `thing*` as `(thing+)?`.

Doing it that latter way means an additional edge, but it buys us
something big in return, which is that it keeps all the patterns going
in the forward direction.  Doing it the other way means the patterns
have to be reversed so as to put them underneath on that recurrent
edge, and it means that readers then have to read them right to left.

Reversing the elements also causes a bug in some diagrams where the
lines end up running in opposing directions and so the trains crash
into each other.  See:

- rust-lang#1787 (comment)

Keeping things in the forward direction avoids this problem.

In this commit, we'll leave in place all the infrastructure for
reversing the elements though it is no longer used.  We can of course
pull this out later.
We no longer need to reverse the elements anywhere in our railroad
diagrams, so let's remove the supporting infrastructure for doing
this.
For `RepeatRange(e, a, b)`, we were rendering `e` on the main line
then rendering under it a message about how many times it may or must
repeat based on `a` and `b`.

The trouble is that if we say that something "repeats once" on the
recurrent edge -- after we've already consumed a thing -- that reads
reasonably as though we're saying that two things can be consumed when
that's not what we mean.

Similarly, it's a bit odd to say, on the recurrent edge, that
something must "repeat twice" when that edge (and presumably then that
rule) may not be taken at all.

Let's solve all this by doing the following:

- For `e{1..1}`, simply render the node.
- For `e{0..1}`, treat this as simply `e?`.
- For `e{0..}`, treat this as simply `e*`.
- For `e{1..}`, treat this as simply `e+`.
- For `e{a..0}`, render an empty node.
- For `e{0..b} b > 1`, treat this as `(e{1..b})?`.
- For `e{1..b} b > 1`, render the node on the main line, then on the
  recurrent line render "at most {b - 1} more times".
- For `e{a..b} a > 1`, make a sequence of length `a` where the final
  node repeats `{1..b - (a - 1)}` times (or `{1..}` times if `b` is
  unbounded).

(We'll also add a check in parsing to ensure that for the range to be
well formed `a <= b`.)

As it turns out, the most straightforward way to implement this isn't
by recursing.  Doing that means we end up needing to take special care
to handle the suffix and the footnote, we have to build up an extra
`Expression` we don't need, and we have to `unwrap` the call.
Instead, it works better to treat this lowering in the manner of a
transitioning state machine in the spirit of `loop match` as proposed
in RFC 3720.
@ehuss ehuss added this pull request to the merge queue Apr 15, 2025
Merged via the queue into rust-lang:master with commit 3340922 Apr 15, 2025
5 checks passed
Zalathar added a commit to Zalathar/rust that referenced this pull request Apr 16, 2025
Update books

## rust-lang/book

1 commits in 45f05367360f033f89235eacbbb54e8d73ce6b70..d33916341d480caede1d0ae57cbeae23aab23e88
2025-04-08 18:24:27 UTC to 2025-04-08 18:24:27 UTC

- Ch01+ch02 after tech review (rust-lang/book#4329)

## rust-lang/edition-guide

2 commits in 1e27e5e6d5133ae4612f5cc195c15fc8d51b1c9c..467f45637b73ec6aa70fb36bc3054bb50b8967ea
2025-04-15 19:49:59 UTC to 2025-04-11 15:27:31 UTC

- fix grammar errors (rust-lang/edition-guide#374)
- remove the unused and deprecated `multilingual` field from `book.toml` (rust-lang/edition-guide#375)

## rust-lang/nomicon

2 commits in b4448fa406a6dccde62d1e2f34f70fc51814cdcc..0c10c30cc54736c5c194ce98c50e2de84eeb6e79
2025-04-09 01:54:42 UTC to 2025-04-07 20:22:31 UTC

- Remove double wording in opaque type chapter (rust-lang/nomicon#487)
- remove `rust-intrinsic` ABI (rust-lang/nomicon#485)

## rust-lang/reference

6 commits in 46435cd4eba11b66acaa42c01da5c80ad88aee4b..3340922df189bddcbaad17dc3927d51a76bcd5ed
2025-04-15 19:03:24 UTC to 2025-04-10 01:56:25 UTC

- Add a new grammar renderer (rust-lang/reference#1787)
- Misc. spelling fixes (rust-lang/reference#1785)
- Fix std::ops links in range-expr (rust-lang/reference#1786)
- traits.md: remove unusual formatting (rust-lang/reference#1784)
- doc: add missing space (rust-lang/reference#1782)
- spelling fix, Discrimants -> Discriminants (rust-lang/reference#1783)
Zalathar added a commit to Zalathar/rust that referenced this pull request Apr 16, 2025
Update books

## rust-lang/book

1 commits in 45f05367360f033f89235eacbbb54e8d73ce6b70..d33916341d480caede1d0ae57cbeae23aab23e88
2025-04-08 18:24:27 UTC to 2025-04-08 18:24:27 UTC

- Ch01+ch02 after tech review (rust-lang/book#4329)

## rust-lang/edition-guide

2 commits in 1e27e5e6d5133ae4612f5cc195c15fc8d51b1c9c..467f45637b73ec6aa70fb36bc3054bb50b8967ea
2025-04-15 19:49:59 UTC to 2025-04-11 15:27:31 UTC

- fix grammar errors (rust-lang/edition-guide#374)
- remove the unused and deprecated `multilingual` field from `book.toml` (rust-lang/edition-guide#375)

## rust-lang/nomicon

2 commits in b4448fa406a6dccde62d1e2f34f70fc51814cdcc..0c10c30cc54736c5c194ce98c50e2de84eeb6e79
2025-04-09 01:54:42 UTC to 2025-04-07 20:22:31 UTC

- Remove double wording in opaque type chapter (rust-lang/nomicon#487)
- remove `rust-intrinsic` ABI (rust-lang/nomicon#485)

## rust-lang/reference

6 commits in 46435cd4eba11b66acaa42c01da5c80ad88aee4b..3340922df189bddcbaad17dc3927d51a76bcd5ed
2025-04-15 19:03:24 UTC to 2025-04-10 01:56:25 UTC

- Add a new grammar renderer (rust-lang/reference#1787)
- Misc. spelling fixes (rust-lang/reference#1785)
- Fix std::ops links in range-expr (rust-lang/reference#1786)
- traits.md: remove unusual formatting (rust-lang/reference#1784)
- doc: add missing space (rust-lang/reference#1782)
- spelling fix, Discrimants -> Discriminants (rust-lang/reference#1783)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Apr 16, 2025
Update books

## rust-lang/book

1 commits in 45f05367360f033f89235eacbbb54e8d73ce6b70..d33916341d480caede1d0ae57cbeae23aab23e88
2025-04-08 18:24:27 UTC to 2025-04-08 18:24:27 UTC

- Ch01+ch02 after tech review (rust-lang/book#4329)

## rust-lang/edition-guide

2 commits in 1e27e5e6d5133ae4612f5cc195c15fc8d51b1c9c..467f45637b73ec6aa70fb36bc3054bb50b8967ea
2025-04-15 19:49:59 UTC to 2025-04-11 15:27:31 UTC

- fix grammar errors (rust-lang/edition-guide#374)
- remove the unused and deprecated `multilingual` field from `book.toml` (rust-lang/edition-guide#375)

## rust-lang/nomicon

2 commits in b4448fa406a6dccde62d1e2f34f70fc51814cdcc..0c10c30cc54736c5c194ce98c50e2de84eeb6e79
2025-04-09 01:54:42 UTC to 2025-04-07 20:22:31 UTC

- Remove double wording in opaque type chapter (rust-lang/nomicon#487)
- remove `rust-intrinsic` ABI (rust-lang/nomicon#485)

## rust-lang/reference

6 commits in 46435cd4eba11b66acaa42c01da5c80ad88aee4b..3340922df189bddcbaad17dc3927d51a76bcd5ed
2025-04-15 19:03:24 UTC to 2025-04-10 01:56:25 UTC

- Add a new grammar renderer (rust-lang/reference#1787)
- Misc. spelling fixes (rust-lang/reference#1785)
- Fix std::ops links in range-expr (rust-lang/reference#1786)
- traits.md: remove unusual formatting (rust-lang/reference#1784)
- doc: add missing space (rust-lang/reference#1782)
- spelling fix, Discrimants -> Discriminants (rust-lang/reference#1783)
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Apr 16, 2025
Rollup merge of rust-lang#139884 - rustbot:docs-update, r=ehuss

Update books

## rust-lang/book

1 commits in 45f05367360f033f89235eacbbb54e8d73ce6b70..d33916341d480caede1d0ae57cbeae23aab23e88
2025-04-08 18:24:27 UTC to 2025-04-08 18:24:27 UTC

- Ch01+ch02 after tech review (rust-lang/book#4329)

## rust-lang/edition-guide

2 commits in 1e27e5e6d5133ae4612f5cc195c15fc8d51b1c9c..467f45637b73ec6aa70fb36bc3054bb50b8967ea
2025-04-15 19:49:59 UTC to 2025-04-11 15:27:31 UTC

- fix grammar errors (rust-lang/edition-guide#374)
- remove the unused and deprecated `multilingual` field from `book.toml` (rust-lang/edition-guide#375)

## rust-lang/nomicon

2 commits in b4448fa406a6dccde62d1e2f34f70fc51814cdcc..0c10c30cc54736c5c194ce98c50e2de84eeb6e79
2025-04-09 01:54:42 UTC to 2025-04-07 20:22:31 UTC

- Remove double wording in opaque type chapter (rust-lang/nomicon#487)
- remove `rust-intrinsic` ABI (rust-lang/nomicon#485)

## rust-lang/reference

6 commits in 46435cd4eba11b66acaa42c01da5c80ad88aee4b..3340922df189bddcbaad17dc3927d51a76bcd5ed
2025-04-15 19:03:24 UTC to 2025-04-10 01:56:25 UTC

- Add a new grammar renderer (rust-lang/reference#1787)
- Misc. spelling fixes (rust-lang/reference#1785)
- Fix std::ops links in range-expr (rust-lang/reference#1786)
- traits.md: remove unusual formatting (rust-lang/reference#1784)
- doc: add missing space (rust-lang/reference#1782)
- spelling fix, Discrimants -> Discriminants (rust-lang/reference#1783)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: The marked PR is awaiting review from a maintainer
Projects
None yet
5 participants