Skip to content

ANTLR-isation of the lexical and pre-processor grammars #342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 31, 2021
Merged

ANTLR-isation of the lexical and pre-processor grammars #342

merged 6 commits into from
Jul 31, 2021

Conversation

Nigel-Ecma
Copy link
Contributor

This is a follow on from PR#339 which only ANTLR-ised the lexical grammar:

  • All the non-pre-processor changes in PR#339 are included.
  • The pre-processor changes in PR#339 made only so the grammar would pass
    ANTLR are replaced by different changes in this PR.
  • The grammar in PR#339 while passing ANTLR does not work unless the
    pre-processor productions are omitted.
  • The original grammar also passed ANTLR but did not work for more reasons

The lexical and pre-processor grammars in this PR both pass ANTLR and should
work 99.9999% (approximately ;-)) of the time:

  • To achieve this the way the pre-processor is described and it's grammar
    have been modified.
  • In this grammar each pre-processor directive is lexer as a single instance
    of a PP_Directive token.
  • To build a working ANTLR lexer these tokens can either (a) be skip'ed so
    the parser does not see them or (b) a semantic action can be defined which
    implements the pre-processing procces.
  • Both options have been tested and a fully working pre-processor implemented
    as a semantic action. This is (currently) written in Java and, even if a C#
    version is written, is not intended for anything other than interal use in
    testing.
  • So what's that 99.9999% approximation? Well the Standard (currently) allows
    input skipped over by #if/.../endif processing to be lexically invalid.
    While the Standard describes this the grammar when run through ANTLR does
    not. So if the feature has been taken advantage of ANTLR will report lexical
    errors.

Note: the combined lexical, pre-processor & syntatic grammar remains
unacceptable to ANTLR due to the mutual left recursion. This still needs
to be removed by hand as shown in the documents distributed on the mailing
list. At least some of the changes required will be submitted in PRs.

This is a follow on from PR#339 which only ANTLR-ised the lexical grammar:

- All the non-pre-processor changes in PR#339 are included.
- The pre-processor changes in PR#339 made only so the grammar would pass
  ANTLR are *replaced* by different changes in this PR.
- The grammar in PR#339 while passing ANTLR does not work unless the
  pre-processor productions are omitted.
- The original grammar also passed ANTLR but did not work for more reasons

The lexical and pre-processor grammars in this PR both pass ANTLR and should
work 99.9999% (approximately ;-)) of the time:

- To achieve this the way the pre-processor is described and it's grammar
  have been modified.
- In this grammar each pre-processor directive is lexer as a single instance
  of a `PP_Directive` token.
- To build a working ANTLR lexer these tokens can either (a) be `skip`'ed so
  the parser does not see them or (b) a semantic action can be defined which
  implements the pre-processing procces.
- Both options have been tested and a fully working pre-processor implemented
  as a semantic action. This is (currently) written in Java and, even if a C#
  version is written, is not intended for anything other than interal use in
  testing.
- So what's that 99.9999% approximation? Well the Standard (currently) allows
  input skipped over by `#if`/.../`endif` processing to be lexically invalid.
  While the Standard describes this the grammar when run through ANTLR does
  not. So if the feature has been taken advantage of ANTLR will report lexical
  errors.

Note: the combined lexical, pre-processor & syntatic grammar remains
unacceptable to ANTLR due to the mutual left recursion. This still needs
to be removed by hand as shown in the documents distributed on the mailing
list. At least some of the changes required will be submitted in PRs.
@Nigel-Ecma Nigel-Ecma self-assigned this Jun 24, 2021
@jskeet
Copy link
Contributor

jskeet commented Jun 24, 2021

I'm hoping to fix the Word converter to allow fragments today.

jskeet added a commit to jskeet/csharpstandard that referenced this pull request Jun 24, 2021
@Nigel-Ecma
Copy link
Contributor Author

So you can overview the grammar easily I've manually updated the grammar.md file so it should now reflect the grammar in the other sections.

@jskeet
Copy link
Contributor

jskeet commented Jun 25, 2021

I'm going to wait until we've talked about #339 before I review this - my brain can't take any more ANTLR right now...

- typos
- removal of proper (British) spellings ;-)
- improved phrasing
- straightened all the quotes in lexical-structure.md *only*, checking across the whole Standard I leave to someone else. I hope the Word conversion puts them back in
- etc.
jskeet added a commit that referenced this pull request Jun 29, 2021
This should enable #339 and #342 in terms of Word conversion.
@gafter gafter self-assigned this Jun 30, 2021
Copy link
Contributor

@MadsTorgersen MadsTorgersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I've brought up a few issues for discussion in comments, and would like to address those in the next meeting, but they should not hold up the PR.

@@ -26,23 +26,29 @@ All terminal characters are to be understood as the appropriate Unicode characte

### 7.2.2 Grammar notation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clarifying text here; it helps a lot with setting the context for how to think of the ANTLR grammars throughout the spec.

@@ -846,10 +934,14 @@ Pre-processing directives are not tokens and are not part of the syntactic gramm
The conditional compilation functionality provided by the `#if`, `#elif`, `#else`, and `#endif` directives is controlled through pre-processing expressions ([§7.5.3](lexical-structure.md#753-pre-processing-expressions)) and conditional compilation symbols.

```ANTLR
Conditional_Symbol
: Identifier_Or_Keyword { IsNotTrueOrFalse() }?
fragment PP_Conditional_Symbol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the ANTLR semantic predicates are a good idea within the text of the standard. I'd almost rather have an end-of-line comment to the same effect. I think this is a good topic for discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder as well...

I'm currently tending towards the idea of having a section in Appendix A (which is all informative) which suggests where they might be used etc. as part of helping people actually use the grammar but without making any of it normative or including implementations (both of which are satisfied already, just distributed within the text).

On the other hand the inline predicate in PP_Start is rather key from an ANTLR perspective... dither, dither ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in meeting to remove them all and potentially add a section in the appendix later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, all semantic predicates except the one on PP_START have been removed. The PP_Start one is in-line and fundamental – without it or something else tokenising might be incorrect. However we could move it to the note if that is the consensus.

@@ -974,56 +1058,63 @@ A `#undef` may "undefine" a conditional compilation symbol that is not defined.
The conditional compilation directives are used to conditionally include or exclude portions of a compilation unit.

```ANTLR
Pp_Conditional
: Pp_If_Section Pp_Elif_Section* Pp_Else_Section? Pp_Endif
fragment PP_Conditional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bread and butter of this approach. This is where the change is made to have conditional directives include only the directive itself, not the source text guarded by it, thus forcing tokenization of the intermediate text.

I have to say that I think it is elegant. I wouldn't want to complicate the grammar to achieve that last 0.001% expressiveness; instead we can wave it away in prose.

@Nigel-Ecma
Copy link
Contributor Author

Nigel-Ecma commented Jul 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants