-
Notifications
You must be signed in to change notification settings - Fork 90
ANTLR: Deciding on how far to go with this #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.] We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used. If we wanted to do this [validate, that is], we probably would use parser modes for As for class_modifier
: 'unsafe'
; because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere. We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features. [Rex drafted the following text.] Proposal 1: We should use ANTLR notation with common conventions. However, we should not spend any effort on getting the grammar to validate with the ANTLR toolkit.Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars. |
If we buy into Proposal 1 (#37 (comment)), we avoid the following (the numbers in parens refer to sections in Rex's paper):
Not_Slash_Or_Asterisk
: '<Any Unicode character except / or *>'
;
Letter_Character
: '<A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
| '<A Unicode_Escape_Sequence representing a Char of classes Lu, Ll, …>'
;
|
Proposal 2: Change the spelling of lexer rule names to use a leading uppercase letter on each word and an underscore as a word separator, as in
|
I would prefer if the ANTLR grammar was written so it would validate, I would have thought this was a key reason to switch to ANTLR notation. I haven’t looked into whether an ANTLR grammar that validates is possible, hence the “prefer” rather than “should”.
That said, if the grammar does not/cannot validate than I think the Standard must state this. Expecting an ANTLR grammar to validate is, at minimum, reasonable given the purpose of ANTLR; so the Standard should make it clear that ANTLR is used as a notation only – and then leave it at that as it does beg the question why the Standard uses a notation that cannot describe the language's syntax, and we probably shouldn’t raise that in the Standard text!
Note that if the limitations of ANTLR are restricted to such things as specifying “any character in Unicode class X” then that looks like a reasonable extension to ANTLR that could be defined (think of the various extended BNFs that exist, I’ve defined one myself when the need arose – though I’ve long since forgotten what that need was just that I had to write a document on the grammar notation! ;-)), but the Standard may not want to go there.
… On 21/11/2020, at 10:36 am, Rex Jaeschke ***@***.***> wrote:
[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.]
We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities <https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities> is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions <https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions>. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used.
If we wanted to do this [validate, that is], we probably would use parser modes for async (different set of keywords inside an async method), LINQ (different set of keywords inside the query), and possibly interpolated strings.
As for unsafe, it can be written
class_modifier
: 'unsafe'
;
because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere.
We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features.
[Rex drafted the following text.]
Proposal 1: We should use ANTLR notation with common conventions. However, we should not spend any effort on getting the grammar to validate with the ANTLR toolkit.
Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSYVW2O2VXDE3TLU62P2BDSQ3ONTANCNFSM4T5ICDNA>.
|
After the previous comment I glanced quickly at the ANTLR docs thinking it surely must handle Unicode character classes and it does, see https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md. |
I've taken a brief look at ANTLR and I beleive the rules Rex quotes above can be written:
The last rule includes a semantic predicate, The above doesn't tell us that all the content-sensitive/semantic needs of the C# grammar can be supported, but I think we should find out – I have a certain unease in publishing a language Standard with a known invalid/incomplete grammar. |
I'm definitely fine with making it clear that the grammar doesn't validate. If there's no way we can make it validate, for reasons expressed before (e.g. interpolated string literals and the grammar ambiguities in 7.2.5) I would be interested to know how feasible it would be to make it "nearly validate" - is there anything we can do to ensure that we don't create invalid rules by mistake? If ANTLR can list the ways in which it's invalid, and if that's a small-but-sensible list, we could perform validation within a GitHub action and compare the result against the expected set of failures. |
Resolutions from Dec 16, 2020 meeting:
|
During the Dec 2020 teleconference, I took an action item to identify the ANTLR convention changes we could/should use. As a result, I have updated the first entry in this issue (#37 (comment)). My recommendations are shown with TBD in the table's Status column. |
@RexJaeschke: I've just removed the stale meeting labels... do we want to come back to this in the January meeting, in which case I'll reapply them? |
@jskeet: I'd very much like us to decide on Proposal 2 (#37 (comment)) on the next call. Assuming we agree to make this change, I can then get started on implementing it in the base spec as well as in the V6 PRs that are impacted.. |
@RexJaeschke: Thanks - I've added the discuss label. Hopefully if we limit the scope to just that topic, it won't take much time. |
We have decided to accept proposal 2, going with the |
Getting Started with the ANTLR ToolkitFor anyone who wants to play with the ANTLR toolkit, I have created the following document: ANTLR Grammar Validation.docx. |
Proposal 4: Make limited use of token namesIn 5.3 of my paper (mentioned at the very top of this issue), I discussed the possibility of using using some extra lexer rules, but as synonyms for individual tokens. I now argue in favor of us going that route. Specifically, by adding the following: DEFAULT : 'default' ;
NULL : 'null' ;
TRUE : 'true' ;
FALSE : 'false' ;
ASTERISK : '*' ;
SLASH : '/' ; [Note the naming convention: Such lexer rules are all-caps to distinguish them from "ordinary" lexer rules, which start with a single cap on each word.] the multiple places where the expanded literal is used are replaced with the name,. For example: Boolean_Literal
: TRUE
| FALSE
; Keyword
...
| TRUE
...
; Pp_Primary_Expression
: TRUE
| FALSE
...
; |
Proposal 5: Resolving Mutual Left-Recursion issueAs I described in Item 15 of my paper, this issue is a major blocking factor both for the lexer and syntactic grammar. In the next month or so, it would be good to have someone assigned to looking at this. |
Proposal 6: Handling the unsafe extensionsCurrently, we define most of these extensions in unsafe-code.md using the following approach: class_modifier
: …
| 'unsafe'
; where that rule is defined in classes.md, as class_modifier
: 'new'
| 'public'
| 'protected'
| 'internal'
| 'private'
| 'abstract'
| 'sealed'
| 'static'
; As such, we cannot simply include in the grammar we input to the ANTLR toolkit the grammar from unsafe-code.md in a verbatim manner. BTW, while most extensions involve the addition of an alternative line to an existing rule, several require changes to existing lines. And the extension grammar modifies rules in a number of md files, not just classes.md. The challenge here is how to deal with this, long term, so we can extract two separate grammars, one core, the other, extended. I propose that we merge the unsafe grammar extensions directly into the corresponding places in the grammar blocks of the early chaptersSome thoughts to consider:
|
Proposal 4: Probably ambivalent on this one myself, for punctuation it often makes sense, don't know how often Proposal 5: No comment ;-) That said I'll probably look into this sometime... Proposal 6: My first thought was "I wonder how hard hacking ANTLR is to support continuation productions..." :-) At the moment the language is in terms of "unsafe extensions", making the proposed change might change it more to "unsafe support is optional" and if so maybe the proposed grammar presentation suggest wording changes should be considered as part of it. C# effectively has two grammars, one with unsafe productions and one without, and that is how it is currently presented. When checking the grammar do not both grammars need to be checked independently? At the moment we have to combine the "continuation" productions to get the grammar with optional parts, easily enough done by searching for ellipses, if we present a single grammar complete with optional parts ripping the latter out to check the grammar without optional parts might end up being more work? On balance at the moment I'd keep the continuation productions, but its not an unchangeable view! |
Meeting 2021-02-10:
|
Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination. I need to review the other proposals above to see if I have other feedback on them. |
Proposal 3: Dealing with
|
Following comments on #225 & #230: Recommendation 1B: skip using ANTLR escapes and go straight to Unicode Recommendation 1D: allow sets or alternates as in Recommendation 2, the subjective choice based on readability Recommendation 2: drop the ANTLR escapes but otherwise allow either Recommendation 3: Yes use the short names, I think they are in more common usage Recommendation 4A: Either both spellings or follow 4B and use semantic predicates – choose subjectively Recommendation 4B: Semantic predicates, so we have a grammar which works. We can provide C# code for the predicates as I understand ANTLR can output C# (I haven't confirmed that and was wondering whether adding Java code for them would be a good look or not ;-)) Recommendation 5: I'd say semantic predicates again. Wherever we use a semantic predicate we can name it and/or add a comment to indicate its purpose - |
(Note to myself as much as anyone else.) Presumably in 4A we could use Other than that note, I agree with Nigel. Presumably for recommendation 5, we really need to use something more than textual constraints if we want the grammar to validate correctly? |
It's on Maven Central here: It looks like the direct download link is: |
Item 1-3 seem fine for me. 4A: 4B: A semantic predicate is the least ugly option here. 5: At this point we're talking more about parser rules than lexer rules.
I would recommend rewriting this as:
Handling this depends on whether we treat pre-processing as part of the normal lexer pass or as a separate pass before the lexer. If it's all one pass, we could use this:
|
Re Proposal 6 (unsafe grammar), on this week's call we agreed that "unsafe type" should be "pointer" type. With that change, Rex's proposal was accepted, and he'll turn that into a PR. |
Regarding Proposal 3, "<...>" (#37 (comment)), I've taken into account input from Nigel (#37 (comment)) and Sam (#37 (comment)). The handling of the Unicode sequences in identifiers results in the following grammar, with current alternatives to be replaced commented out and followed by their replacement(s): Letter_Character
// : '<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
// | '<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
: [\p{L}\p{Nl}] // category Letter, all subcategories; category Number, subcategory letter
| Unicode_Escape_Sequence { IsLetterCharacter() }?
;
Combining_Character
// : '<A Unicode character of classes Mn or Mc>'
// | '<A Unicode_Escape_Sequence representing a character of classes Mn or Mc>'
: [\p{Mn}\p{Mc}] // category Mark, subcategories non-spacing and spacing combining
| Unicode_Escape_Sequence { IsCombiningCharacter() }?
;
Decimal_Digit_Character
// : '<A Unicode character of the class Nd>'
// | '<A Unicode_Escape_Sequence representing a character of the class Nd>'
: [\p{Nd}] // category Number, subcategory decimal digit
| Unicode_Escape_Sequence { IsDecimalDigitCharacter() }?
;
Connecting_Character
// : '<A Unicode character of the class Pc>'
// | '<A Unicode_Escape_Sequence representing a character of the class Pc>'
: [\p{Pc}] // category Punctuation, subcategory connector
| Unicode_Escape_Sequence { IsConnectingCharacter() }?
;
Formatting_Character
// : '<A Unicode character of the class Cf>'
// | '<A Unicode_Escape_Sequence representing a character of the class Cf>'
: [\p{Cf}] // category Other, subcategory format
| Unicode_Escape_Sequence { IsFormattingCharacter() }?
; While the definitions of the This leaves the final two uses of the `'<...>' notation Available_Identifier
// : '<An Identifier_Or_Keyword that is not a Keyword>'
: Identifier_Or_Keyword { IsNotAKeyword() }?
;
Conditional_Symbol
// : '<Any Identifier_Or_Keyword except true or false>'
: Identifier_Or_Keyword { IsNotTrueOrFalse() }?
; Nigel and I support the use of semantic predicates here as well, which I have shown above. Sam proposed some rearrangement of the grammar instead. @sharwell, is the semantic predicate approach OK with you? How does this all look to you, @Nigel-Ecma? |
Proposal 5: Resolving Mutual Left-Recursion issue@MadsTorgersen, @sharwell, @Nigel-Ecma There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows: Pp_Or_Expression
: Pp_And_Expression
// | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression
| Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace?
;
Pp_And_Expression
: Pp_Equality_Expression
// | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression
| Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace?
;
Pp_Equality_Expression
: Pp_Unary_Expression
// | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression
| Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace?
// | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression
| Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?
; All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct? |
All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?
It changes the grouping/associativity, e.g. given a || b || c one version parses it is (a || b) || c the other as a || (b || c)
Does it matter *in this case* (obviously it can matter in some cases)?
(Sorry no answer right now, I’m distracting myself clearing email when I should be doing something else which I must get back to…)
… On 16/04/2021, at 3:05 am, Rex Jaeschke ***@***.***> wrote:
Proposal 5: Resolving Mutual Left-Recursion issue
@MadsTorgersen <https://github.com/MadsTorgersen>, @sharwell <https://github.com/sharwell>, @Nigel-Ecma <https://github.com/Nigel-Ecma>
There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving ||, &&, ==, and !=.
On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows:
Pp_Or_Expression
: Pp_And_Expression
// | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression
| Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace?
;
Pp_And_Expression
: Pp_Equality_Expression
// | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression
| Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace?
;
Pp_Equality_Expression
: Pp_Unary_Expression
// | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression
| Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace?
// | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression
| Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?
;
All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSYVWYGM4PWKLW2OA4YHG3TI36DPANCNFSM4T5ICDNA>.
|
This Issue was closed automatically when the unsafe grammar PR was merged, but we still have the left-recursive situation to deal with, so am re-opening. |
Proposal 7: Marking lexer helper rules with
|
Proposal 8: Updating text in lexical-structure.md to state our intent w.r.t using ANTLR notation vs. having an ANTLR-ready grammar.Nigel is handling this as part of his grammar overhaul PRs. |
@Nigel-Ecma I'm inclined to close this Issue, unless you want to keep it open as a hook for things you are still working on. |
I think closing it is fine, you could reference the various PR’s now merged (interpolated strings mentioned in #37 are still in progress but a PR will come)
|
Uh oh!
There was an error while loading. Please reload this page.
On 2020-10-03, I circulated the following paper: ANTLR and the C# Spec.docx. In that, I discussed using the ANTLR grammar notation and the non-trivial amount of work I thought it would take to get the grammar to validate using the ANTLR toolkit.
Neal has reviewed that paper and his feedback follows in a comment below, with a proposal on how we might go forward. Once we discuss (and probably agree with) his suggestion, I'll update this comment by adding the remaining questions and to keep track of our decisions on those questions .
Here are the ANTLR-related questions and their status:
fragment
When you add a comment to this issue, please limit it to a single Proposal # and say which Proposal # that is.
Any decisions we make that require changes to the grammar notation will not only have to be made to the v5 grammar we're starting with in v6, but also possibly to the v6 (and v7) PRs in waiting.
The text was updated successfully, but these errors were encountered: