ANTLR-isation of the lexical and pre-processor grammars #342

Nigel-Ecma · 2021-06-24T05:38:19Z

This is a follow on from PR#339 which only ANTLR-ised the lexical grammar:

All the non-pre-processor changes in PR#339 are included.
The pre-processor changes in PR#339 made only so the grammar would pass
ANTLR are replaced by different changes in this PR.
The grammar in PR#339 while passing ANTLR does not work unless the
pre-processor productions are omitted.
The original grammar also passed ANTLR but did not work for more reasons

The lexical and pre-processor grammars in this PR both pass ANTLR and should
work 99.9999% (approximately ;-)) of the time:

To achieve this the way the pre-processor is described and it's grammar
have been modified.
In this grammar each pre-processor directive is lexer as a single instance
of a PP_Directive token.
To build a working ANTLR lexer these tokens can either (a) be skip'ed so
the parser does not see them or (b) a semantic action can be defined which
implements the pre-processing procces.
Both options have been tested and a fully working pre-processor implemented
as a semantic action. This is (currently) written in Java and, even if a C#
version is written, is not intended for anything other than interal use in
testing.
So what's that 99.9999% approximation? Well the Standard (currently) allows
input skipped over by #if/.../endif processing to be lexically invalid.
While the Standard describes this the grammar when run through ANTLR does
not. So if the feature has been taken advantage of ANTLR will report lexical
errors.

Note: the combined lexical, pre-processor & syntatic grammar remains
unacceptable to ANTLR due to the mutual left recursion. This still needs
to be removed by hand as shown in the documents distributed on the mailing
list. At least some of the changes required will be submitted in PRs.

This is a follow on from PR#339 which only ANTLR-ised the lexical grammar: - All the non-pre-processor changes in PR#339 are included. - The pre-processor changes in PR#339 made only so the grammar would pass ANTLR are *replaced* by different changes in this PR. - The grammar in PR#339 while passing ANTLR does not work unless the pre-processor productions are omitted. - The original grammar also passed ANTLR but did not work for more reasons The lexical and pre-processor grammars in this PR both pass ANTLR and should work 99.9999% (approximately ;-)) of the time: - To achieve this the way the pre-processor is described and it's grammar have been modified. - In this grammar each pre-processor directive is lexer as a single instance of a `PP_Directive` token. - To build a working ANTLR lexer these tokens can either (a) be `skip`'ed so the parser does not see them or (b) a semantic action can be defined which implements the pre-processing procces. - Both options have been tested and a fully working pre-processor implemented as a semantic action. This is (currently) written in Java and, even if a C# version is written, is not intended for anything other than interal use in testing. - So what's that 99.9999% approximation? Well the Standard (currently) allows input skipped over by `#if`/.../`endif` processing to be lexically invalid. While the Standard describes this the grammar when run through ANTLR does not. So if the feature has been taken advantage of ANTLR will report lexical errors. Note: the combined lexical, pre-processor & syntatic grammar remains unacceptable to ANTLR due to the mutual left recursion. This still needs to be removed by hand as shown in the documents distributed on the mailing list. At least some of the changes required will be submitted in PRs.

jskeet · 2021-06-24T06:05:48Z

I'm hoping to fix the Word converter to allow fragments today.

This should enable dotnet#339 and dotnet#342 in terms of Word conversion.

Nigel-Ecma · 2021-06-24T21:16:59Z

So you can overview the grammar easily I've manually updated the grammar.md file so it should now reflect the grammar in the other sections.

jskeet · 2021-06-25T11:15:09Z

I'm going to wait until we've talked about #339 before I review this - my brain can't take any more ANTLR right now...

- typos - removal of proper (British) spellings ;-) - improved phrasing - straightened all the quotes in lexical-structure.md *only*, checking across the whole Standard I leave to someone else. I hope the Word conversion puts them back in - etc.

This should enable #339 and #342 in terms of Word conversion.

MadsTorgersen

This is great. I've brought up a few issues for discussion in comments, and would like to address those in the next meeting, but they should not hold up the PR.

MadsTorgersen · 2021-07-08T19:03:10Z

standard/lexical-structure.md

@@ -26,23 +26,29 @@ All terminal characters are to be understood as the appropriate Unicode characte

 ### 7.2.2 Grammar notation


Great clarifying text here; it helps a lot with setting the context for how to think of the ANTLR grammars throughout the spec.

standard/lexical-structure.md

MadsTorgersen · 2021-07-08T21:03:20Z

standard/lexical-structure.md

@@ -846,10 +934,14 @@ Pre-processing directives are not tokens and are not part of the syntactic gramm
 The conditional compilation functionality provided by the `#if`, `#elif`, `#else`, and `#endif` directives is controlled through pre-processing expressions ([§7.5.3](lexical-structure.md#753-pre-processing-expressions)) and conditional compilation symbols.

 ```ANTLR
-Conditional_Symbol
-    : Identifier_Or_Keyword { IsNotTrueOrFalse() }?
+fragment PP_Conditional_Symbol


I wonder whether the ANTLR semantic predicates are a good idea within the text of the standard. I'd almost rather have an end-of-line comment to the same effect. I think this is a good topic for discussion.

I wonder as well...

I'm currently tending towards the idea of having a section in Appendix A (which is all informative) which suggests where they might be used etc. as part of helping people actually use the grammar but without making any of it normative or including implementations (both of which are satisfied already, just distributed within the text).

On the other hand the inline predicate in PP_Start is rather key from an ANTLR perspective... dither, dither ;-)

Resolved in meeting to remove them all and potentially add a section in the appendix later.

OK, all semantic predicates except the one on PP_START have been removed. The PP_Start one is in-line and fundamental – without it or something else tokenising might be incorrect. However we could move it to the note if that is the consensus.

MadsTorgersen · 2021-07-08T21:07:03Z

standard/lexical-structure.md

@@ -974,56 +1058,63 @@ A `#undef` may "undefine" a conditional compilation symbol that is not defined.
 The conditional compilation directives are used to conditionally include or exclude portions of a compilation unit.

 ```ANTLR
-Pp_Conditional
-    : Pp_If_Section Pp_Elif_Section* Pp_Else_Section? Pp_Endif
+fragment PP_Conditional


This is the bread and butter of this approach. This is where the change is made to have conditional directives include only the directive itself, not the source text guarded by it, thus forcing tokenization of the intermediate text.

I have to say that I think it is elegant. I wouldn't want to complicate the grammar to achieve that last 0.001% expressiveness; instead we can wave it away in prose.

standard/lexical-structure.md

Nigel-Ecma · 2021-07-19T23:40:50Z

Thanks for the review Mads. I’ve added a few comments to the comments to help the discussion along.

…

On 9/07/2021, at 9:18 am, Mads Torgersen ***@***.***> wrote: @MadsTorgersen approved this pull request. This is great. I've brought up a few issues for discussion in comments, and would like to address those in the next meeting, but they should not hold up the PR. In standard/lexical-structure.md <#342 (comment)>: > @@ -26,23 +26,29 @@ All terminal characters are to be understood as the appropriate Unicode characte ### 7.2.2 Grammar notation Great clarifying text here; it helps a lot with setting the context for how to think of the ANTLR grammars throughout the spec. In standard/lexical-structure.md <#342 (comment)>: > The lexical processing of a C# compilation unit consists of reducing the file into a sequence of tokens that becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, and pre-processing directives can cause sections of the compilation unit to be skipped, but otherwise these lexical elements have no impact on the syntactic structure of a C# program. When several lexical grammar productions match a sequence of characters in a compilation unit, the lexical processing always forms the longest possible lexical element. > *Example*: The character sequence `//` is processed as the beginning of a single-line comment because that lexical element is longer than a single `/` token. *end example* +The rules of the lexical grammar are ordered top-to-bottom, with the first matching rule determining the recognised token. Implicit rules ([§7.2.3](lexical-structure.md#723-lexical-grammar)) derived from literal strings in the grammar are ordered before the explicit rules. This ordering simplifies the grammar. I am concerned about the reliance on ordering. The old spec didn't have that. It introduces an implicit context for grammar rules that prevents them from standing alone, and for instance makes it hard to reliably quote them. I'd like to suggest that we minimize the reliance on ordering, and where necessary, make a comment as to the exceptions incurred by previous rules. In standard/lexical-structure.md <#342 (comment)>: > @@ -846,10 +934,14 @@ Pre-processing directives are not tokens and are not part of the syntactic gramm The conditional compilation functionality provided by the `#if`, `#elif`, `#else`, and `#endif` directives is controlled through pre-processing expressions ([§7.5.3](lexical-structure.md#753-pre-processing-expressions)) and conditional compilation symbols. ```ANTLR -Conditional_Symbol - : Identifier_Or_Keyword { IsNotTrueOrFalse() }? +fragment PP_Conditional_Symbol I wonder whether the ANTLR semantic predicates are a good idea within the text of the standard. I'd almost rather have an end-of-line comment to the same effect. I think this is a good topic for discussion. In standard/lexical-structure.md <#342 (comment)>: > @@ -974,56 +1058,63 @@ A `#undef` may "undefine" a conditional compilation symbol that is not defined. The conditional compilation directives are used to conditionally include or exclude portions of a compilation unit. ```ANTLR -Pp_Conditional - : Pp_If_Section Pp_Elif_Section* Pp_Else_Section? Pp_Endif +fragment PP_Conditional This is the bread and butter of this approach. This is where the change is made to have conditional directives include only the directive itself, not the source text guarded by it, thus forcing tokenization of the intermediate text. I have to say that I think it is elegant. I wouldn't want to complicate the grammar to achieve that last 0.001% expressiveness; instead we can wave it away in prose. In standard/lexical-structure.md <#342 (comment)>: > -Skipped_Section_Part - : Skipped_Characters? New_Line - | Pp_Directive - ; - -Skipped_Characters - : Whitespace? Not_Number_Sign Input_Character* - ; +Conditional compilation directives shall be written as sets consisting of, in order, a `#if` directive, zero or more `#elif` directives, zero or one `#else` directive, and a `#endif` directive. Between the directives are ***conditional sections*** of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete sets. This is where we pay the price of not being able to sequence the conditional directives in grammar. That's fine - the term "set" may be confusing here, because it does not refer to a mathematical set (unordered by definition). Maybe another term, like "group", would be better? (Yes, groups are also a mathematical notion, but less likely to be understood as such in this context!) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#342 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSYVW63OT2FGKZDARO6XCTTWYIZJANCNFSM47HE25MQ>.

…ates, it is also now removed.

Nigel-Ecma requested review from gafter, jskeet and MadsTorgersen June 24, 2021 06:01

Nigel-Ecma self-assigned this Jun 24, 2021

Yes, I goofed in git again :-( Should be fine now.

3560b6a

jskeet added a commit to jskeet/csharpstandard that referenced this pull request Jun 24, 2021

Parse (and format) ANTLR fragment productions in the Word converter

e8bf0dc

This should enable dotnet#339 and dotnet#342 in terms of Word conversion.

jskeet mentioned this pull request Jun 24, 2021

Parse (and format) ANTLR fragment productions in the Word converter #343

Merged

Manual update of grammar.md to match grammar in the other sections.

0d893a1

Nigel-Ecma mentioned this pull request Jun 26, 2021

The ANTLR-isation of the lexical grammar. The C# "lexical" grammar bo… #339

Closed

Various tweaks thanks to @jskeet:

f1d1dd7

- typos - removal of proper (British) spellings ;-) - improved phrasing - straightened all the quotes in lexical-structure.md *only*, checking across the whole Standard I leave to someone else. I hope the Word conversion puts them back in - etc.

jskeet added a commit that referenced this pull request Jun 29, 2021

Parse (and format) ANTLR fragment productions in the Word converter

7a7ea8c

This should enable #339 and #342 in terms of Word conversion.

gafter self-assigned this Jun 30, 2021

MadsTorgersen approved these changes Jul 8, 2021

View reviewed changes

Nigel-Ecma marked this pull request as ready for review July 20, 2021 03:44

This was referenced Jul 29, 2021

The Identifier grammar and its reference to keywords #260

Closed

Distinguishing between a keyword and an identifier #259

Closed

Nigel-Ecma added 2 commits July 30, 2021 11:51

Changes after meeting, now ready to merge.

1e670af

Turns out there was one ANTLR *action* as well as the semantic predic…

d44a06a

…ates, it is also now removed.

BillWagner merged commit a842605 into dotnet:draft-v6 Jul 31, 2021

RexJaeschke mentioned this pull request Oct 10, 2021

ANTLR: Deciding on how far to go with this #37

Closed

jskeet mentioned this pull request Mar 15, 2022

Do we need to parse the grammar at all in the Word converter? #494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ANTLR-isation of the lexical and pre-processor grammars #342

ANTLR-isation of the lexical and pre-processor grammars #342

Uh oh!

Nigel-Ecma commented Jun 24, 2021

Uh oh!

jskeet commented Jun 24, 2021

Uh oh!

Nigel-Ecma commented Jun 24, 2021

Uh oh!

jskeet commented Jun 25, 2021

Uh oh!

MadsTorgersen left a comment

Uh oh!

MadsTorgersen Jul 8, 2021

Uh oh!

Uh oh!

MadsTorgersen Jul 8, 2021

Uh oh!

Nigel-Ecma Jul 19, 2021

Uh oh!

jskeet Jul 28, 2021

Uh oh!

Nigel-Ecma Jul 29, 2021

Uh oh!

MadsTorgersen Jul 8, 2021

Uh oh!

Uh oh!

Nigel-Ecma commented Jul 19, 2021 via email

Uh oh!

Uh oh!

		@@ -26,23 +26,29 @@ All terminal characters are to be understood as the appropriate Unicode characte

		### 7.2.2 Grammar notation

ANTLR-isation of the lexical and pre-processor grammars #342

ANTLR-isation of the lexical and pre-processor grammars #342

Uh oh!

Conversation

Nigel-Ecma commented Jun 24, 2021

Uh oh!

jskeet commented Jun 24, 2021

Uh oh!

Nigel-Ecma commented Jun 24, 2021

Uh oh!

jskeet commented Jun 25, 2021

Uh oh!

MadsTorgersen left a comment

Choose a reason for hiding this comment

Uh oh!

MadsTorgersen Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MadsTorgersen Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

Nigel-Ecma Jul 19, 2021

Choose a reason for hiding this comment

Uh oh!

jskeet Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

Nigel-Ecma Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

MadsTorgersen Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nigel-Ecma commented Jul 19, 2021 via email

Uh oh!

Uh oh!