Skip to content

ANTLR: Deciding on how far to go with this #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RexJaeschke opened this issue Nov 20, 2020 · 41 comments · Fixed by #233
Closed

ANTLR: Deciding on how far to go with this #37

RexJaeschke opened this issue Nov 20, 2020 · 41 comments · Fixed by #233
Assignees
Labels
status: dependency This issue or PR depends on another (see comments) type: process This issue relates to processes
Milestone

Comments

@RexJaeschke
Copy link
Contributor

RexJaeschke commented Nov 20, 2020

On 2020-10-03, I circulated the following paper: ANTLR and the C# Spec.docx. In that, I discussed using the ANTLR grammar notation and the non-trivial amount of work I thought it would take to get the grammar to validate using the ANTLR toolkit.

Neal has reviewed that paper and his feedback follows in a comment below, with a proposal on how we might go forward. Once we discuss (and probably agree with) his suggestion, I'll update this comment by adding the remaining questions and to keep track of our decisions on those questions .

Here are the ANTLR-related questions and their status:

Proposal # Status Description
1 Yes it should Decide whether our grammar should be complete enough to validate (3 Rex's paper). Note that validation here means "get through the ANTLR tool without error," which doesn't necessarily mean that any parser generated will be complete/correct. This is especially true for semantic predicates.
2 Resolved, PRs #208, #209 Change case of lexer rule names (5.1 Rex's paper)
3 Resolved, PR #261 Expanding lexer rule alternatives of the form <...> (3 Rex's paper)
4 Resolved, PR #225 Limited use of token names (5.3 Rex's paper)
5 In-progress Resolve Mutual Left-Recursion issue (15 Rex's paper)
6 Resolved, PR #233 How to handle the unsafe extensions
7 Resolved Marking lexer helper rules with fragment
8 Resolved Ultimately, revise 7.2, "Grammars" in lexical-structure.md to say what we do/don't do w.r.t full grammar validation

When you add a comment to this issue, please limit it to a single Proposal # and say which Proposal # that is.

Any decisions we make that require changes to the grammar notation will not only have to be made to the v5 grammar we're starting with in v6, but also possibly to the v6 (and v7) PRs in waiting.

@RexJaeschke RexJaeschke added meeting: discuss This issue should be discussed at the next TC49-TG2 meeting meeting: proposal There is an informal proposal in the issue, worth discussing in a meeting labels Nov 20, 2020
@RexJaeschke
Copy link
Contributor Author

[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.]

We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used.

If we wanted to do this [validate, that is], we probably would use parser modes for async (different set of keywords inside an async method), LINQ (different set of keywords inside the query), and possibly interpolated strings.

As for unsafe, it can be written

class_modifier
    : 'unsafe'
    ;

because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere.

We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features.

[Rex drafted the following text.]

Proposal 1: We should use ANTLR notation with common conventions. However, we should not spend any effort on getting the grammar to validate with the ANTLR toolkit.

Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars.

@RexJaeschke
Copy link
Contributor Author

If we buy into Proposal 1 (#37 (comment)), we avoid the following (the numbers in parens refer to sections in Rex's paper):

  • Having to convert all descriptive alternatives of the form '<...>' (3), such as the following:
Not_Slash_Or_Asterisk
    : '<Any Unicode character except / or *>'
    ;

Letter_Character
    : '<A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
    | '<A Unicode_Escape_Sequence representing a Char of classes Lu, Ll, …>'
    ;
  • We can ignore the fragment rule name prefix (6)
  • We can ignore separating the lexer and parser grammar into different files (9)
  • We can ignore lexical modes (12)
  • We don't need to handle ambiguities in the grammar notation itself (13)
  • We can probably leave the handling of unsafe extensions as they are (14)
  • We can ignore ANTLR's limits on recursive rules (15)
  • We can ignore using explicit operator-associativity notation (17)

@RexJaeschke
Copy link
Contributor Author

RexJaeschke commented Nov 20, 2020

Proposal 2: Change the spelling of lexer rule names to use a leading uppercase letter on each word and an underscore as a word separator, as in Not_Slash_Or_Asterisk and Letter_Character.

The ANTLR convention is that a lexer rule name begins with an uppercase letter, and may be followed by uppercase and/or lowercase letters, digits, and underscores.

We do not currently begin lexer rule names with an uppercase letter, and we should. However, I think that using all-caps is distracting. And as we're used to having a dash separator in v5 and prior (and now underscore), it seems useful to continue having separators. The remaining question then is how to use case. Whatever the decisions made, it's a matter of running a set of mechanical conversions on the existing spec grammar fragments and rule names in the narrative to make these changes.

If we adopt this change, it will involve a lot of edits to the Lexical structure clause and a few in other clauses that refer to lexer rules. We can't simply use a find-and-replace-text tool as we have some lexer rule names that are commonly used English words in the spec (such as whitespace, comment, identifier, and keyword). So, we'll want to convert all lexer rule names inside a grammar block and between italic fences, as in rule_name, but not elsewhere.

Recommendation: We change to starting lexer rule names with an uppercase letter. As to whether we up-case the first letter in subsequent words is debatable. Here are the obvious styles:
Integer_literal // first word leading cap
Integer_Literal // all words leading cap [Rex's first choice]
IntegerLiteral // all words leading cap, no separators

Before locking in a decision, we should look at parser rule names, and be consistent or deliberately different. Perhaps the simplest rule to follow here is “all lowercase except the initial letter of each word.” That is the style I use throughout this paper.

Resolved: During the telcon on 2021-01-13, we chose Recommendation 2, Integer_Literal, with all words having a leading uppercase letter, and words separated by underscore.

@RexJaeschke RexJaeschke self-assigned this Nov 20, 2020
@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Nov 21, 2020 via email

@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Nov 21, 2020

After the previous comment I glanced quickly at the ANTLR docs thinking it surely must handle Unicode character classes and it does, see https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md.

@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Nov 22, 2020

I've taken a brief look at ANTLR and I beleive the rules Rex quotes above can be written:

Not_Slash_Or_Asterisk
    : ~[*/] // Any Unicode character except / or *
    ;

Letter_Character
    : [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}] // A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl
    | Unicode_Escape_Sequence { IsLetterCharacter() }? // as previous but using an escape sequence
    ;

The last rule includes a semantic predicate, { IsLetterCharacter() }? this a way context-sensitive or other semantic checks can be included in the syntactical grammar. If the function IsLetterCharacter() returns false the alternative is "unviable". Semantic predicates can be included in both parser and lexer rules. The predicates themselves can be defined in the same file as the grammar; are written in Java, C++, Javascript or any other language ANTLR supports; and have access to the source being parsed/lexed.

The above doesn't tell us that all the content-sensitive/semantic needs of the C# grammar can be supported, but I think we should find out – I have a certain unease in publishing a language Standard with a known invalid/incomplete grammar.

@jskeet
Copy link
Contributor

jskeet commented Dec 16, 2020

I'm definitely fine with making it clear that the grammar doesn't validate. If there's no way we can make it validate, for reasons expressed before (e.g. interpolated string literals and the grammar ambiguities in 7.2.5) I would be interested to know how feasible it would be to make it "nearly validate" - is there anything we can do to ensure that we don't create invalid rules by mistake? If ANTLR can list the ways in which it's invalid, and if that's a small-but-sensible list, we could perform validation within a GitHub action and compare the result against the expected set of failures.

@jskeet
Copy link
Contributor

jskeet commented Dec 16, 2020

Resolutions from Dec 16, 2020 meeting:

  • We should follow not just the syntax but conventions of ANTLR, and that can be a medium-term goal
  • We will not immediately try to make the grammar completely valid, but that's something we might do in the future (and may well be a good task for a contributor - we should reach out to Terence Parr once the first bullet is done)
  • We will add a GitHub Action to extract and validate the grammar, with a known set of existing issues

@RexJaeschke
Copy link
Contributor Author

During the Dec 2020 teleconference, I took an action item to identify the ANTLR convention changes we could/should use. As a result, I have updated the first entry in this issue (#37 (comment)). My recommendations are shown with TBD in the table's Status column.

@jskeet jskeet added type: process This issue relates to processes and removed meeting: discuss This issue should be discussed at the next TC49-TG2 meeting meeting: proposal There is an informal proposal in the issue, worth discussing in a meeting labels Jan 7, 2021
@jskeet
Copy link
Contributor

jskeet commented Jan 8, 2021

@RexJaeschke: I've just removed the stale meeting labels... do we want to come back to this in the January meeting, in which case I'll reapply them?

@RexJaeschke
Copy link
Contributor Author

@jskeet: I'd very much like us to decide on Proposal 2 (#37 (comment)) on the next call. Assuming we agree to make this change, I can then get started on implementing it in the base spec as well as in the V6 PRs that are impacted..

@jskeet jskeet added the meeting: discuss This issue should be discussed at the next TC49-TG2 meeting label Jan 9, 2021
@jskeet
Copy link
Contributor

jskeet commented Jan 9, 2021

@RexJaeschke: Thanks - I've added the discuss label. Hopefully if we limit the scope to just that topic, it won't take much time.

@jskeet
Copy link
Contributor

jskeet commented Jan 13, 2021

We have decided to accept proposal 2, going with the Integer_Literal style (so PascalCasing but with underscores as well).

@RexJaeschke
Copy link
Contributor Author

Getting Started with the ANTLR Toolkit

For anyone who wants to play with the ANTLR toolkit, I have created the following document: ANTLR Grammar Validation.docx.

@RexJaeschke
Copy link
Contributor Author

Proposal 4: Make limited use of token names

In 5.3 of my paper (mentioned at the very top of this issue), I discussed the possibility of using using some extra lexer rules, but as synonyms for individual tokens. I now argue in favor of us going that route. Specifically, by adding the following:

DEFAULT  : 'default' ;
NULL     : 'null' ;
TRUE     : 'true' ;
FALSE    : 'false' ;
ASTERISK : '*' ;
SLASH    : '/' ;

[Note the naming convention: Such lexer rules are all-caps to distinguish them from "ordinary" lexer rules, which start with a single cap on each word.]

the multiple places where the expanded literal is used are replaced with the name,. For example:

Boolean_Literal
    : TRUE
    | FALSE
    ;
Keyword
    ...
    | TRUE
    ...
    ;
Pp_Primary_Expression
    : TRUE
    | FALSE
    ...
    ;

@RexJaeschke
Copy link
Contributor Author

Proposal 5: Resolving Mutual Left-Recursion issue

As I described in Item 15 of my paper, this issue is a major blocking factor both for the lexer and syntactic grammar. In the next month or so, it would be good to have someone assigned to looking at this.

@RexJaeschke
Copy link
Contributor Author

Proposal 6: Handling the unsafe extensions

Currently, we define most of these extensions in unsafe-code.md using the following approach:

class_modifier
    : …
    | 'unsafe'
    ;

where that rule is defined in classes.md, as

class_modifier
    : 'new'
    | 'public'
    | 'protected'
    | 'internal'
    | 'private'
    | 'abstract'
    | 'sealed'
    | 'static'
    ;

As such, we cannot simply include in the grammar we input to the ANTLR toolkit the grammar from unsafe-code.md in a verbatim manner.

BTW, while most extensions involve the addition of an alternative line to an existing rule, several require changes to existing lines. And the extension grammar modifies rules in a number of md files, not just classes.md.

The challenge here is how to deal with this, long term, so we can extract two separate grammars, one core, the other, extended.

I propose that we merge the unsafe grammar extensions directly into the corresponding places in the grammar blocks of the early chapters

Some thoughts to consider:

  • Close to 100% of the readers of our spec likely are using an implementation that supports the unsafe extensions.
  • We add //-style ANTLR comments to the unsafe extensions, so they are easily discernable in the grammar blocks
  • The grammar extraction tool need not be modified; it will now extract the unsafe extensions as part of the ordinary grammar blocks.
  • In Chapter 23, we remove the grammar extensions, and where we reference rules in earlier chapters, we add links to them.
  • This completely eliminates the need to address Issue Dependencies on Chapter 23, "Unsafe code," in earlier chapters #211.

@RexJaeschke RexJaeschke added this to the C# 6 milestone Jan 26, 2021
@Nigel-Ecma
Copy link
Contributor

Proposal 4: Probably ambivalent on this one myself, for punctuation it often makes sense, don't know how often true/false occur in the grammar offhand but there is no objection really to any key/reserved word getting its own lexical rule.

Proposal 5: No comment ;-) That said I'll probably look into this sometime...

Proposal 6: My first thought was "I wonder how hard hacking ANTLR is to support continuation productions..." :-)

At the moment the language is in terms of "unsafe extensions", making the proposed change might change it more to "unsafe support is optional" and if so maybe the proposed grammar presentation suggest wording changes should be considered as part of it.

C# effectively has two grammars, one with unsafe productions and one without, and that is how it is currently presented. When checking the grammar do not both grammars need to be checked independently?

At the moment we have to combine the "continuation" productions to get the grammar with optional parts, easily enough done by searching for ellipses, if we present a single grammar complete with optional parts ripping the latter out to check the grammar without optional parts might end up being more work?

On balance at the moment I'd keep the continuation productions, but its not an unchangeable view!

@jskeet
Copy link
Contributor

jskeet commented Feb 10, 2021

Meeting 2021-02-10:

  • Proposal 4: accepted
  • Proposal 5: a member of the C# compiler team is going to help; Jon has sent an invitation
  • Proposal 6: discussed in more detail... proposal of:
    • Explicitly calling out the change to "conformant compilers must know the whole grammar, and give an appropriate error when unsafe code is present if they don't support it"
    • One grammar
    • When something referring to unsafe aspects of the grammar, we use a new production that we refer to from earlier chapters, but is defined in clause 23
    • Rex to try this

@sharwell
Copy link
Contributor

sharwell commented Feb 11, 2021

Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination.

I need to review the other proposals above to see if I have other feedback on them.

@RexJaeschke
Copy link
Contributor Author

Proposal 3: Dealing with <…>-style Lexer Rule Alternatives

@sharwell , @Nigel-Ecma please review this.

(See Nigel’s comment re this at #37 (comment).) Also see Issue #230.

Item 1: Easy Fixes

We have about a dozen occurrences of alternatives consisting of one, two, or “any but one” character. The following recommendations show the current syntax followed by the proposed syntax:

Recommendation 1A: For printable characters, use that character, possibly with a comment if multiple glyphs might look similar, as follows:

'<_ the underscore character (U+005F)>'
'_'    // the underscore character (U+005F)

Recommendation 1B: For non-printable characters having an ANTLR escape, optionally with comment, use that escape, as follows:

'<Line feed character (U+000A)>'
'\n'    // Line feed

Recommendation 1C: For all other non-printable characters, use a Unicode sequence with comment, as follows:

'<Paragraph separator character (U+2029)>'
'\u2029'    // Paragraph separator

BTW, ANTLR has extended its support for Unicode escapes; \uXXXX has been augmented by \u{XXXXXX}.

Recommendation 1D: For “any but” characters, use an ANTLR set, possibly with comment, as follows:

'<any character except ">'
~["]    // any character except "

Item 2: Simple but Inelegant Fixes

A number of alternatives reference other rules, including New_Line_Character, SLASH, or ASTERISK. For example, in the case of

'<Any Unicode character except a New_Line_Character>'

the ideal replacement would seem to be

~New_Line_Character

Unfortunately, the operand of the not-operator (~) is considered to be a sub-rule, and sub rules are not permitted to contain rule names (such as New_Line_Character), so our only recourse is to use the expanded rule.

Recommendation 2: Use one of the following approaches instead:

~[\r\n\u0085\u2028\u2029]   // anything but New_Line_Character
~('\r' | '\n' | '\u0085' | '\u2028' | '\u2029')   // anything but New_Line_Character

Item 3: Unicode Characters in Identifiers

A number of alternatives refer to Unicode class rules. (As reported by Nigel) ANTLR now has direct support for these. The following recommendations show the current syntax followed by the proposed syntax:

'<Any character with Unicode class Zs>'
[\p{Zs}]   // Any character with Unicode class Zs

Zs is the short name for the long name Space_Separator (which may be used instead).

Note: A \p{…} sequence can only appear in a set, even though this instance involves a set of 1!

FYI, uppercase \P negates the selection.

Similarly,

'<A Unicode character of classes Mn or Mc>'
[\p{Mn}\p{Mc}]   // category Mark, subcategories non-spacing and spacing combining

The following introduces general categories, which are groupings of related subcategories; the final alternative is the shortest:

'<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}]   //
[\p{L}\p{Nl}]   // category Letter, all subcategories; Category Number, subcategory letter

Recommendation 3: Use the shortest possible set of \p{…} sequences with Unicode short names, possibly with a comment, as shown above.

Item 4: Identifiers Containing Unicode_Escape_Sequence

Here, we’re dealing with the subtle difference between things like '\\u005F' and '\u005F', the first being a 6-character string, the second being a single character.

Let’s begin with a simple example of the general problem; how can we rewrite the following?:

Underscore_Character
    …
    | '<A Unicode_Escape_Sequence representing the character U+005F>'
;

A rewrite is

Underscore_Character
    …
    | '\\u005F' // underscore string
    | '\\u005f' // underscore string
    ;

We need to allow for either case for the hex digits a-f. Of course, if an allowed letter, decimal digit, or separator contains multiple hex digits a-f, we’d need four or more alternatives, which gets unwieldy. Fortunately, the Underscore_Character rule is only used in one place, by Identifier_Start_Character. (Underscores in other positions are handled as part of Connecting_Character.)

Recommendation 4A: Show both spellings of the hex sequence for this case.

Here is the more general case: How to deal with the following?

'<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'

To address this, Nigel proposed using the following

Letter_Character
    …
    | Unicode_Escape_Sequence { IsLetterCharacter() }? //
    ;

The last rule includes a semantic predicate, { IsLetterCharacter() }? this a way context-sensitive or other semantic checks can be included in the syntactical grammar. If the function IsLetterCharacter() returns false the alternative is "unviable". Semantic predicates can be included in both parser and lexer rules. The predicates themselves can be defined in the same file as the grammar; are written in Java, C++, Javascript or any other language ANTLR supports; and have access to the source being parsed/lexed.

After a brief read about semantic predicates, and some testing, I found that Nigel’s approach validates.

This approach would also need corresponding semantic predicate alternatives, like the following:

Combining_Character
    : …
    | Unicode_Escape_Sequence { IsCombiningCharacter() }? 
    ;

Decimal_Digit_Character
    : …
    | Unicode_Escape_Sequence { IsDecimalDigitCharacter() }? 
    ;

Connecting_Character
    : …
    | Unicode_Escape_Sequence { IsConnectingCharacter() }? 
    ;

Formatting_Character
    : …
    | Unicode_Escape_Sequence { IsFormattingCharacter() }? 
    ;

While this approach validates, we’d probably have to add a general note up front re grammars as well as one in the identifier lexer rule section, saying what these things are. Although we don’t have a goal of actually using ANTLR to create a lexer or parser, for testing purposes, I do find it useful to do so, at least on a subset of the language. However, to do that, I’d have to add (at least skeletal) definitions for these methods.

I see two ways to go here:

  1. Leave it as is (and not explain how one might deal with this in an ANTLR-generated lexer, but up-front do explain the use of <...>, in general)
  2. Add semantic predicates and accompanying notes as discussed above

Recommendation 4B: What approach do we want?

Item 5: Other Challenges

We have two alternatives that cannot (easily) be expressed in a grammar:

'<An Identifier_Or_Keyword that is not a Keyword>'
'<Any Identifier_Or_Keyword except true or false>'

My first exposure to grammars was with the C Standard, in which each section having grammar puts that in a Syntax subsection, which is optionally followed by a Constraints subsection. So, I think of each exception in the examples above as being a constraint.

I see three ways to go here:

  1. Leave it as is
  2. Simplify the alternative and state the constraint in the accompanying comment
  3. Simplify the alternative and state the constraint in text below the grammar

These are illustrated below, in order:

'<Any Identifier_Or_Keyword except true or false>'
Identifier_Or_Keyword   // But NOT true or false
Identifier_Or_Keyword   // see constraint below

Constraint: The Identifier_Or_Keyword in Conditional_Symbol shall not be true or false.

Recommendation 5: What approach do we want?

@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Mar 8, 2021

Following comments on #225 & #230:

Recommendation 1B: skip using ANTLR escapes and go straight to Unicode

Recommendation 1D: allow sets or alternates as in Recommendation 2, the subjective choice based on readability

Recommendation 2: drop the ANTLR escapes but otherwise allow either

Recommendation 3: Yes use the short names, I think they are in more common usage

Recommendation 4A: Either both spellings or follow 4B and use semantic predicates – choose subjectively

Recommendation 4B: Semantic predicates, so we have a grammar which works. We can provide C# code for the predicates as I understand ANTLR can output C# (I haven't confirmed that and was wondering whether adding Java code for them would be a good look or not ;-))

Recommendation 5: I'd say semantic predicates again. Wherever we use a semantic predicate we can name it and/or add a comment to indicate its purpose - { ButNotTrueOrFalse() }? or { IsIdentifierOrKeywordOtherThanTrueAndFalse() }? maybe

@jskeet
Copy link
Contributor

jskeet commented Mar 10, 2021

(Note to myself as much as anyone else.) Presumably in 4A we could use [fF] to avoid show two entirely separate options?

Other than that note, I agree with Nigel. Presumably for recommendation 5, we really need to use something more than textual constraints if we want the grammar to validate correctly?

@sharwell
Copy link
Contributor

Where can I easily get the jar file for your optimized fork? I'm happy to plug that in.

It's on Maven Central here:
https://search.maven.org/artifact/com.tunnelvisionlabs/antlr4/4.9.0/jar

It looks like the direct download link is:
https://repo1.maven.org/maven2/com/tunnelvisionlabs/antlr4/4.9.0/antlr4-4.9.0-complete.jar

@sharwell
Copy link
Contributor

Item 1-3 seem fine for me.

4A: '\\u005F' is easy, but there is a combinatorics problem for each additional letter. I recommend using '\\u005' [fF] instead.

4B: A semantic predicate is the least ugly option here.

5: At this point we're talking more about parser rules than lexer rules.

An Identifier_Or_Keyword that is not a Keyword

I would recommend rewriting this as:

availableIdentifier
  : Identifier
  | contextualKeyword
  ;

contextualKeyword
  : // itemized list of all keyword-like tokens that are not keywords
  ;

Any Identifier_Or_Keyword except true or false

Handling this depends on whether we treat pre-processing as part of the normal lexer pass or as a separate pass before the lexer. If it's all one pass, we could use this:

conditionalKeyword
  : 'true' | 'false'
  ;

keyword
  : conditionalKeyword
  | // all the keywords except true and false
  ;

@RexJaeschke
Copy link
Contributor Author

RexJaeschke commented Mar 12, 2021

Re Proposal 6 (unsafe grammar), on this week's call we agreed that "unsafe type" should be "pointer" type. With that change, Rex's proposal was accepted, and he'll turn that into a PR.

@RexJaeschke
Copy link
Contributor Author

Regarding Proposal 3, "<...>" (#37 (comment)), I've taken into account input from Nigel (#37 (comment)) and Sam (#37 (comment)).

The handling of the Unicode sequences in identifiers results in the following grammar, with current alternatives to be replaced commented out and followed by their replacement(s):

Letter_Character
//    : '<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
//    | '<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'

    : [\p{L}\p{Nl}]     // category Letter, all subcategories; category Number, subcategory letter
    | Unicode_Escape_Sequence { IsLetterCharacter() }?
    ;

Combining_Character
//    : '<A Unicode character of classes Mn or Mc>'
//    | '<A Unicode_Escape_Sequence representing a character of classes Mn or Mc>'

    : [\p{Mn}\p{Mc}]    // category Mark, subcategories non-spacing and spacing combining
    | Unicode_Escape_Sequence { IsCombiningCharacter() }?
    ;

Decimal_Digit_Character
//    : '<A Unicode character of the class Nd>'
//    | '<A Unicode_Escape_Sequence representing a character of the class Nd>'

    : [\p{Nd}]      // category Number, subcategory decimal digit
    | Unicode_Escape_Sequence { IsDecimalDigitCharacter() }?
    ;

Connecting_Character
//    : '<A Unicode character of the class Pc>'
//    | '<A Unicode_Escape_Sequence representing a character of the class Pc>'

    : [\p{Pc}]      // category Punctuation, subcategory connector
    | Unicode_Escape_Sequence { IsConnectingCharacter() }?
    ;

Formatting_Character
//    : '<A Unicode character of the class Cf>'
//    | '<A Unicode_Escape_Sequence representing a character of the class Cf>'

    : [\p{Cf}]      // category Other, subcategory format
    | Unicode_Escape_Sequence { IsFormattingCharacter() }?
    ;

While the definitions of the Is* predicate methods need to be in the grammar file input to ANTLR when validation occurs (as part of Bill's actions), I see a way to smuggle them in without having to include them in the spec.

This leaves the final two uses of the `'<...>' notation

Available_Identifier
//    : '<An Identifier_Or_Keyword that is not a Keyword>'

    : Identifier_Or_Keyword { IsNotAKeyword() }?
    ;

Conditional_Symbol
//    : '<Any Identifier_Or_Keyword except true or false>'

    : Identifier_Or_Keyword { IsNotTrueOrFalse() }?
    ;

Nigel and I support the use of semantic predicates here as well, which I have shown above. Sam proposed some rearrangement of the grammar instead. @sharwell, is the semantic predicate approach OK with you?

How does this all look to you, @Nigel-Ecma?

@RexJaeschke
Copy link
Contributor Author

Proposal 5: Resolving Mutual Left-Recursion issue

@MadsTorgersen, @sharwell, @Nigel-Ecma

There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving ||, &&, ==, and !=.

On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows:

Pp_Or_Expression
    : Pp_And_Expression

//  | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression
    | Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace?
    ;
    
Pp_And_Expression
    : Pp_Equality_Expression

//  | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression
    | Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace?
    ;

Pp_Equality_Expression
    : Pp_Unary_Expression

//  | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression
    | Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace? 
    
//  | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression
    | Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?  
    ;

All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?

@RexJaeschke RexJaeschke added the meeting: discuss This issue should be discussed at the next TC49-TG2 meeting label Apr 15, 2021
@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Apr 16, 2021 via email

@RexJaeschke RexJaeschke removed the meeting: discuss This issue should be discussed at the next TC49-TG2 meeting label May 5, 2021
@RexJaeschke
Copy link
Contributor Author

This Issue was closed automatically when the unsafe grammar PR was merged, but we still have the left-recursive situation to deal with, so am re-opening.

@RexJaeschke RexJaeschke reopened this May 13, 2021
@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Jun 1, 2021

Issues #259 & #260 raise questions which are dependent on this issue and investigations into the grammar it has triggered. Those questions should be considered in any resolution of this one.

@Nigel-Ecma Nigel-Ecma added the status: dependency This issue or PR depends on another (see comments) label Jun 1, 2021
@RexJaeschke
Copy link
Contributor Author

Proposal 7: Marking lexer helper rules with fragment

Previously, it seemed that applying fragment to a lexer rule might be useful, but not necessary. However, as Nigel discovered, it really should be used on all lexer rules that help other rules. That is, any lexer rule that does not result in a complete token should be marked fragment. For example:

Decimal_Integer_Literal     // forms a complete token; is not a helper rule
    : Decimal_Digit+ Integer_Type_Suffix?
    ;
    
fragment
Decimal_Digit         // helper rule; should have fragment prefix
    : '0'..'9'
    ;
    
fragment
Integer_Type_Suffix         // helper rule; should have fragment prefix
    : 'U' | 'u' | 'L' | 'l' | 'UL' | 'Ul' | 'uL' | 'ul' | 'LU' | 'Lu' | 'lU' | 'lu'
    ; 

As such, Nigel will add this prefix, as appropriate, as part of his efforts.

@RexJaeschke
Copy link
Contributor Author

Proposal 8: Updating text in lexical-structure.md to state our intent w.r.t using ANTLR notation vs. having an ANTLR-ready grammar.

Nigel is handling this as part of his grammar overhaul PRs.

@RexJaeschke
Copy link
Contributor Author

RexJaeschke commented Oct 7, 2021

@Nigel-Ecma I'm inclined to close this Issue, unless you want to keep it open as a hook for things you are still working on.

@Nigel-Ecma
Copy link
Contributor

Nigel-Ecma commented Oct 8, 2021 via email

@RexJaeschke
Copy link
Contributor Author

Almost all of the outstanding tasks were handled by Nigel in various PRs, including #339, #342, and #351. Closing this Issue with Nigel's consent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: dependency This issue or PR depends on another (see comments) type: process This issue relates to processes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants