-
Notifications
You must be signed in to change notification settings - Fork 90
ANTLR-isation of the lexical and pre-processor grammars #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a follow on from PR#339 which only ANTLR-ised the lexical grammar: - All the non-pre-processor changes in PR#339 are included. - The pre-processor changes in PR#339 made only so the grammar would pass ANTLR are *replaced* by different changes in this PR. - The grammar in PR#339 while passing ANTLR does not work unless the pre-processor productions are omitted. - The original grammar also passed ANTLR but did not work for more reasons The lexical and pre-processor grammars in this PR both pass ANTLR and should work 99.9999% (approximately ;-)) of the time: - To achieve this the way the pre-processor is described and it's grammar have been modified. - In this grammar each pre-processor directive is lexer as a single instance of a `PP_Directive` token. - To build a working ANTLR lexer these tokens can either (a) be `skip`'ed so the parser does not see them or (b) a semantic action can be defined which implements the pre-processing procces. - Both options have been tested and a fully working pre-processor implemented as a semantic action. This is (currently) written in Java and, even if a C# version is written, is not intended for anything other than interal use in testing. - So what's that 99.9999% approximation? Well the Standard (currently) allows input skipped over by `#if`/.../`endif` processing to be lexically invalid. While the Standard describes this the grammar when run through ANTLR does not. So if the feature has been taken advantage of ANTLR will report lexical errors. Note: the combined lexical, pre-processor & syntatic grammar remains unacceptable to ANTLR due to the mutual left recursion. This still needs to be removed by hand as shown in the documents distributed on the mailing list. At least some of the changes required will be submitted in PRs.
I'm hoping to fix the Word converter to allow fragments today. |
This should enable dotnet#339 and dotnet#342 in terms of Word conversion.
So you can overview the grammar easily I've manually updated the grammar.md file so it should now reflect the grammar in the other sections. |
I'm going to wait until we've talked about #339 before I review this - my brain can't take any more ANTLR right now... |
- typos - removal of proper (British) spellings ;-) - improved phrasing - straightened all the quotes in lexical-structure.md *only*, checking across the whole Standard I leave to someone else. I hope the Word conversion puts them back in - etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I've brought up a few issues for discussion in comments, and would like to address those in the next meeting, but they should not hold up the PR.
@@ -26,23 +26,29 @@ All terminal characters are to be understood as the appropriate Unicode characte | |||
|
|||
### 7.2.2 Grammar notation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great clarifying text here; it helps a lot with setting the context for how to think of the ANTLR grammars throughout the spec.
@@ -846,10 +934,14 @@ Pre-processing directives are not tokens and are not part of the syntactic gramm | |||
The conditional compilation functionality provided by the `#if`, `#elif`, `#else`, and `#endif` directives is controlled through pre-processing expressions ([§7.5.3](lexical-structure.md#753-pre-processing-expressions)) and conditional compilation symbols. | |||
|
|||
```ANTLR | |||
Conditional_Symbol | |||
: Identifier_Or_Keyword { IsNotTrueOrFalse() }? | |||
fragment PP_Conditional_Symbol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the ANTLR semantic predicates are a good idea within the text of the standard. I'd almost rather have an end-of-line comment to the same effect. I think this is a good topic for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder as well...
I'm currently tending towards the idea of having a section in Appendix A (which is all informative) which suggests where they might be used etc. as part of helping people actually use the grammar but without making any of it normative or including implementations (both of which are satisfied already, just distributed within the text).
On the other hand the inline predicate in PP_Start
is rather key from an ANTLR perspective... dither, dither ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in meeting to remove them all and potentially add a section in the appendix later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, all semantic predicates except the one on PP_START
have been removed. The PP_Start
one is in-line and fundamental – without it or something else tokenising might be incorrect. However we could move it to the note if that is the consensus.
@@ -974,56 +1058,63 @@ A `#undef` may "undefine" a conditional compilation symbol that is not defined. | |||
The conditional compilation directives are used to conditionally include or exclude portions of a compilation unit. | |||
|
|||
```ANTLR | |||
Pp_Conditional | |||
: Pp_If_Section Pp_Elif_Section* Pp_Else_Section? Pp_Endif | |||
fragment PP_Conditional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bread and butter of this approach. This is where the change is made to have conditional directives include only the directive itself, not the source text guarded by it, thus forcing tokenization of the intermediate text.
I have to say that I think it is elegant. I wouldn't want to complicate the grammar to achieve that last 0.001% expressiveness; instead we can wave it away in prose.
Thanks for the review Mads. I’ve added a few comments to the comments to help the discussion along.
… On 9/07/2021, at 9:18 am, Mads Torgersen ***@***.***> wrote:
@MadsTorgersen approved this pull request.
This is great. I've brought up a few issues for discussion in comments, and would like to address those in the next meeting, but they should not hold up the PR.
In standard/lexical-structure.md <#342 (comment)>:
> @@ -26,23 +26,29 @@ All terminal characters are to be understood as the appropriate Unicode characte
### 7.2.2 Grammar notation
Great clarifying text here; it helps a lot with setting the context for how to think of the ANTLR grammars throughout the spec.
In standard/lexical-structure.md <#342 (comment)>:
>
The lexical processing of a C# compilation unit consists of reducing the file into a sequence of tokens that becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, and pre-processing directives can cause sections of the compilation unit to be skipped, but otherwise these lexical elements have no impact on the syntactic structure of a C# program.
When several lexical grammar productions match a sequence of characters in a compilation unit, the lexical processing always forms the longest possible lexical element.
> *Example*: The character sequence `//` is processed as the beginning of a single-line comment because that lexical element is longer than a single `/` token. *end example*
+The rules of the lexical grammar are ordered top-to-bottom, with the first matching rule determining the recognised token. Implicit rules ([§7.2.3](lexical-structure.md#723-lexical-grammar)) derived from literal strings in the grammar are ordered before the explicit rules. This ordering simplifies the grammar.
I am concerned about the reliance on ordering. The old spec didn't have that. It introduces an implicit context for grammar rules that prevents them from standing alone, and for instance makes it hard to reliably quote them. I'd like to suggest that we minimize the reliance on ordering, and where necessary, make a comment as to the exceptions incurred by previous rules.
In standard/lexical-structure.md <#342 (comment)>:
> @@ -846,10 +934,14 @@ Pre-processing directives are not tokens and are not part of the syntactic gramm
The conditional compilation functionality provided by the `#if`, `#elif`, `#else`, and `#endif` directives is controlled through pre-processing expressions ([§7.5.3](lexical-structure.md#753-pre-processing-expressions)) and conditional compilation symbols.
```ANTLR
-Conditional_Symbol
- : Identifier_Or_Keyword { IsNotTrueOrFalse() }?
+fragment PP_Conditional_Symbol
I wonder whether the ANTLR semantic predicates are a good idea within the text of the standard. I'd almost rather have an end-of-line comment to the same effect. I think this is a good topic for discussion.
In standard/lexical-structure.md <#342 (comment)>:
> @@ -974,56 +1058,63 @@ A `#undef` may "undefine" a conditional compilation symbol that is not defined.
The conditional compilation directives are used to conditionally include or exclude portions of a compilation unit.
```ANTLR
-Pp_Conditional
- : Pp_If_Section Pp_Elif_Section* Pp_Else_Section? Pp_Endif
+fragment PP_Conditional
This is the bread and butter of this approach. This is where the change is made to have conditional directives include only the directive itself, not the source text guarded by it, thus forcing tokenization of the intermediate text.
I have to say that I think it is elegant. I wouldn't want to complicate the grammar to achieve that last 0.001% expressiveness; instead we can wave it away in prose.
In standard/lexical-structure.md <#342 (comment)>:
>
-Skipped_Section_Part
- : Skipped_Characters? New_Line
- | Pp_Directive
- ;
-
-Skipped_Characters
- : Whitespace? Not_Number_Sign Input_Character*
- ;
+Conditional compilation directives shall be written as sets consisting of, in order, a `#if` directive, zero or more `#elif` directives, zero or one `#else` directive, and a `#endif` directive. Between the directives are ***conditional sections*** of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete sets.
This is where we pay the price of not being able to sequence the conditional directives in grammar. That's fine - the term "set" may be confusing here, because it does not refer to a mathematical set (unordered by definition). Maybe another term, like "group", would be better? (Yes, groups are also a mathematical notion, but less likely to be understood as such in this context!)
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#342 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSYVW63OT2FGKZDARO6XCTTWYIZJANCNFSM47HE25MQ>.
|
…ates, it is also now removed.
This is a follow on from PR#339 which only ANTLR-ised the lexical grammar:
ANTLR are replaced by different changes in this PR.
pre-processor productions are omitted.
The lexical and pre-processor grammars in this PR both pass ANTLR and should
work 99.9999% (approximately ;-)) of the time:
have been modified.
of a
PP_Directive
token.skip
'ed sothe parser does not see them or (b) a semantic action can be defined which
implements the pre-processing procces.
as a semantic action. This is (currently) written in Java and, even if a C#
version is written, is not intended for anything other than interal use in
testing.
input skipped over by
#if
/.../endif
processing to be lexically invalid.While the Standard describes this the grammar when run through ANTLR does
not. So if the feature has been taken advantage of ANTLR will report lexical
errors.
Note: the combined lexical, pre-processor & syntatic grammar remains
unacceptable to ANTLR due to the mutual left recursion. This still needs
to be removed by hand as shown in the documents distributed on the mailing
list. At least some of the changes required will be submitted in PRs.