Skip to content

ANTLR-isation of parse/syntactic grammar #351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions standard/enums.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,12 @@ enum_declaration
;

enum_base
: ':' struct_type
: ':' integral_type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes narrow down enum_base to avoid bad stuff happening. Text below has a small addition to describe the two alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine

| ':' integral_type_name
;

integral_type_name
: type_name // restricted to one of System.{SByte,Byte,Int16,UInt16,Int32,UInt32,Int64,UInt64}
;

enum_body
Expand All @@ -35,7 +40,7 @@ enum_body
;
```

Each enum type has a corresponding integral type called the ***underlying type*** of the enum type. This underlying type shall be able to represent all the enumerator values defined in the enumeration. If the *enum_base* is present, it explicitly declares the underlying type. The underlying type shall be one of the *integral types* ([§9.3.6](types.md#936-integral-types)) other than `char`.
Each enum type has a corresponding integral type called the ***underlying type*** of the enum type. This underlying type shall be able to represent all the enumerator values defined in the enumeration. If the *enum_base* is present, it explicitly declares the underlying type. The underlying type shall be one of the *integral types* ([§9.3.6](types.md#936-integral-types)) other than `char`; specified either by keyword (*integral_type*), or by one of the full type names that the integral types alias ([§9.3.5](types.md#935-simple-types)) (*integral_type_name*).

> *Note*: Neither `char` nor `System.Char` can be used as an underlying type. *end note*

Expand Down
9 changes: 5 additions & 4 deletions standard/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -1130,6 +1130,8 @@ primary_no_array_creation_expression
;
```

> *Note*: These grammar rules are not ANTLR-ready as they are part of a set of mutually left-recursive rules (`primary_expression`, `primary_no_array_creation_expression`, `member_access`, `invocation_expression`, `element_access`, `post_increment_expression`, `post_decrement_expression`, `pointer_member_access` and `pointer_element_access`) which ANTLR does not handle. Standard techniques can be used to transform the grammar to remove the mutual left-recursion. This has not been done as not all parsing strategies require it (e.g. an LALR parser would not) and doing so would obfuscate the structure and description.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one mutual left recursive set of rules being left. To factor the rules in the list can be inlined into primary_no_array_creation_expression (see the Parse_CSharp.g4 file sent out). If you've an opinion on removing/leaving the MLR please comment, or would like propose an alternative grammar please do so!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your comment in the change that a refactoring would obscure the structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree as well.


*pointer_member_access* ([§23.6.3](unsafe-code.md#2363-pointer-member-access)) and *pointer_element_access* ([§23.6.4](unsafe-code.md#2364-pointer-element-access)) are only available in unsafe code ([§23](unsafe-code.md#23-unsafe-code)).

Primary expressions are divided between *array_creation_expression*s and *primary_no_array_creation_expression*s. Treating *array_creation_expression* in this way, rather than listing it along with the other simple expression forms, enables the grammar to disallow potentially confusing code such as
Expand Down Expand Up @@ -2330,14 +2332,13 @@ nameof_expression
;

named_entity
: simple_name
| named_entity_target '.' identifier type_argument_list?
: named_entity_target ('.' identifier type_argument_list?)*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial MLR removal

;

named_entity_target
: 'this'
: simple_name
| 'this'
| 'base'
| named_entity
| predefined_type
| qualified_alias_member
;
Expand Down
58 changes: 28 additions & 30 deletions standard/grammar.md
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,7 @@ namespace_name
type_name
: namespace_or_type_name
;

namespace_or_type_name
: identifier type_argument_list?
| namespace_or_type_name '.' identifier type_argument_list?
Expand Down Expand Up @@ -566,14 +566,18 @@ delegate_type

// Source: §9.3.1 General
value_type
: non_nullable_value_type
| nullable_value_type
;

non_nullable_value_type
: struct_type
| enum_type
;

struct_type
: type_name
| simple_type
| nullable_value_type
;

simple_type
Expand Down Expand Up @@ -604,18 +608,14 @@ floating_point_type
| 'double'
;

nullable_value_type
: non_nullable_value_type '?'
;

non_nullable_value_type
: type
;

enum_type
: type_name
;

nullable_value_type
: non_nullable_value_type '?'
;

// Source: §9.4.2 Type arguments
type_argument_list
: '<' type_arguments '>'
Expand Down Expand Up @@ -850,7 +850,6 @@ comma
: ','
;


// Source: §12.7.13 The sizeof operator
sizeof_expression
: 'sizeof' '(' unmanaged_type ')'
Expand All @@ -874,16 +873,15 @@ default_value_expression
nameof_expression
: 'nameof' '(' named_entity ')'
;

named_entity
: simple_name
| named_entity_target '.' identifier type_argument_list?
: named_entity_target ('.' identifier type_argument_list?)*
;

named_entity_target
: 'this'
: simple_name
| 'this'
| 'base'
| named_entity
| predefined_type
| qualified_alias_member
;
Expand Down Expand Up @@ -1387,7 +1385,7 @@ catch_clause
exception_specifier
: '(' type identifier? ')'
;

finally_clause
: 'finally' block
;
Expand Down Expand Up @@ -1532,7 +1530,7 @@ type_parameter_constraints_clauses
: type_parameter_constraints_clause
| type_parameter_constraints_clauses type_parameter_constraints_clause
;

type_parameter_constraints_clause
: 'where' type_parameter ':' type_parameter_constraints
;
Expand Down Expand Up @@ -1715,7 +1713,7 @@ property_modifier
| 'extern'
| unsafe_modifier // unsafe code support
;

property_body
: '{' accessor_declarations '}' property_initializer?
| '=>' expression ';'
Expand Down Expand Up @@ -1861,7 +1859,6 @@ operator_body
| ';'
;


// Source: §15.11.1 General
constructor_declaration
: attributes? constructor_modifier* constructor_declarator constructor_body
Expand Down Expand Up @@ -1972,7 +1969,7 @@ array_initializer
variable_initializer_list
: variable_initializer (',' variable_initializer)*
;

variable_initializer
: expression
| array_initializer
Expand All @@ -1998,13 +1995,11 @@ variant_type_parameter_list
: '<' variant_type_parameters '>'
;

// Source: §18.2.3.1 General
variant_type_parameters
: attributes? variance_annotation? type_parameter
| variant_type_parameters ',' attributes? variance_annotation? type_parameter
;

// Source: §18.2.3.1 General
variance_annotation
: 'in'
| 'out'
Expand Down Expand Up @@ -2038,7 +2033,6 @@ interface_property_declaration
: attributes? 'new'? type identifier '{' interface_accessors '}'
;

// Source: §18.4.3 Interface properties
interface_accessors
: attributes? 'get' ';'
| attributes? 'set' ';'
Expand All @@ -2062,7 +2056,12 @@ enum_declaration
;

enum_base
: ':' struct_type
: ':' integral_type
| ':' integral_type_name
;

integral_type_name
: type_name // restricted to one of System.{SByte,Byte,Int16,UInt16,Int32,UInt32,Int64,UInt64}
;

enum_body
Expand All @@ -2084,7 +2083,6 @@ enum_member_declarations
: enum_member_declaration (',' enum_member_declaration)*
;

// Source: §19.4 Enum members
enum_member_declaration
: attributes? identifier ('=' constant_expression)?
;
Expand All @@ -2093,7 +2091,7 @@ enum_member_declaration
delegate_declaration
: attributes? delegate_modifier* 'delegate' return_type identifier variant_type_parameter_list? '(' formal_parameter_list? ')' type_parameter_constraints_clause* ';'
;

delegate_modifier
: 'new'
| 'public'
Expand Down Expand Up @@ -2193,8 +2191,8 @@ unsafe_statement

// Source: §23.3 Pointer types
pointer_type
: unmanaged_type '*'
| 'void' '*'
: value_type '*'+
| 'void' '*'+
;

// Source: §23.6.2 Pointer indirection
Expand Down
6 changes: 2 additions & 4 deletions standard/lexical-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ All terminal characters are to be understood as the appropriate Unicode characte

The lexical and syntactic grammars are presented in the ANTLR grammar tool's Extended Backus-Naur form.

While the ANTLR notation is used this Standard does not present a complete ANTLR-ready "reference grammar" for C#; writing a lexer and parser, either by hand or using a tool such as ANTLR, is outside the scope of a language specification. With that qualification, this Standard attempts to minimize the gap between the specified grammar and that required to build a lexer and parser in ANTLR with the notable exception of the preprocessor ([§7.5](lexical-structure.md#75-pre-processing-directives) which requires more substantial work to fit into the ANTLR model.
While the ANTLR notation is used this Standard does not present a complete ANTLR-ready "reference grammar" for C#; writing a lexer and parser, either by hand or using a tool such as ANTLR, is outside the scope of a language specification. With that qualification, this Standard attempts to minimize the gap between the specified grammar and that required to build a lexer and parser in ANTLR.

ANTLR distinguishes between lexical and syntactic, termed parser by ANTLR, grammars in its notation by starting lexical rules with an initial uppercase letter and parser rules with an initial lowercase letter.

Expand All @@ -40,8 +40,6 @@ The lexical grammar of C# is presented in [§7.3](lexical-structure.md#73-lexica

Many of the terminal symbols of the syntactic grammar are not defined explicitly as tokens in the lexical grammar. Rather advantage is taken of the ANTLR behavior that literal strings in the grammar are extracted as implicit lexical tokens; this allows keywords, operators, etc. to be represented in the grammar by their literal representation rather than a token name.

The same behavior is also used to simplify the lexical grammar, see [§7.3.1](lexical-structure.md#731-general).

Every compilation unit in a C# program shall conform to the *input* production of the lexical grammar ([§7.3.1](lexical-structure.md#731-general)).

### 7.2.4 Syntactic grammar
Expand Down Expand Up @@ -1071,7 +1069,7 @@ fragment PP_Endif
;
```

Conditional compilation directives shall be written as groups consisting of, in order, a `#if` directive, zero or more `#elif` directives, zero or one `#else` directive, and a `#endif` directive. Between the directives are ***conditional sections*** of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete groups.
Conditional compilation directives shall be written in groups consisting of, in order, a `#if` directive, zero or more `#elif` directives, zero or one `#else` directive, and a `#endif` directive. Between the directives are ***conditional sections*** of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete groups.

> *Example*: The following example illustrates how conditional compilation directives can nest:
> ```csharp
Expand Down
21 changes: 11 additions & 10 deletions standard/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,14 +139,18 @@ A value type is either a struct type or an enumeration type. C# provides a set o

```ANTLR
value_type
: non_nullable_value_type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rules here formed an MLR set, just a bit or re-org fixed that.

| nullable_value_type
;

non_nullable_value_type
: struct_type
| enum_type
;

struct_type
: type_name
| simple_type
| nullable_value_type
;

simple_type
Expand Down Expand Up @@ -177,17 +181,13 @@ floating_point_type
| 'double'
;

nullable_value_type
: non_nullable_value_type '?'
;

non_nullable_value_type
: type
;

enum_type
: type_name
;

nullable_value_type
: non_nullable_value_type '?'
;
```

Unlike a variable of a reference type, a variable of a value type can contain the value `null` only if the value type is a nullable value type [§9.3.11](types.md#9311-nullable-value-types). For every non-nullable value type there is a corresponding nullable value type denoting the same set of values plus the value `null`.
Expand Down Expand Up @@ -612,7 +612,8 @@ Because of this equivalence, the following holds:

```ANTLR
unmanaged_type
: type
: value_type
| pointer_type // unsafe code support
;
```

Expand Down
4 changes: 2 additions & 2 deletions standard/unsafe-code.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@ A *pointer_type* is written as an *unmanaged_type* ([§9.8](types.md#98-unmanage

```ANTLR
pointer_type
: unmanaged_type '*'
| 'void' '*'
: value_type ('*')+
| 'void' ('*')+
;
```

Expand Down