Skip to content

Commit 11fba02

Browse files
committed
[RFC] Clarify and restrict unicode support
This proposal alters the parser grammar to be more specific about what unicode characters are allowed as source, restricts those characters interpretted as white space or line breaks, and clarifies line break behavior relative to error reporting with a non-normative note.
1 parent 8516a85 commit 11fba02

3 files changed

+78
-30
lines changed

spec/Appendix A -- Notation Conventions.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -166,13 +166,16 @@ Example_param :
166166
This specification describes the semantic value of many grammar productions in
167167
the form of a list of algorithmic steps.
168168

169-
For example, this describes how a parser should interpret a Unicode escape
170-
sequence which appears in a string literal:
169+
For example, this describes how a parser should interpret a string literal:
171170

172-
EscapedUnicode :: u /[0-9A-Fa-f]{4}/
171+
StringValue :: `""`
173172

174-
* Let {codePoint} be the number represented by the four-digit hexadecimal sequence.
175-
* The string value is the Unicode character represented by {codePoint}.
173+
* Return an empty Unicode character sequence.
174+
175+
StringValue :: `"` StringCharacter+ `"`
176+
177+
* Return the Unicode character sequence of all {StringCharacter}
178+
Unicode character values.
176179

177180

178181
## Algorithms

spec/Appendix B -- Grammar Summary.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,29 @@
11
# B. Appendix: Grammar Summary
22

3-
SourceCharacter :: "Any Unicode code point"
3+
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
44

55

66
## Ignored Tokens
77

88
Ignored ::
9+
- UnicodeBOM
910
- WhiteSpace
1011
- LineTerminator
1112
- Comment
1213
- Comma
1314

15+
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
16+
1417
WhiteSpace ::
1518
- "Horizontal Tab (U+0009)"
16-
- "Vertical Tab (U+000B)"
17-
- "Form Feed (U+000C)"
1819
- "Space (U+0020)"
19-
- "No-break Space (U+00A0)"
2020

2121
LineTerminator ::
2222
- "New Line (U+000A)"
23-
- "Carriage Return (U+000D)"
24-
- "Line Separator (U+2028)"
25-
- "Paragraph Separator (U+2029)"
23+
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
24+
- "Carriage Return (U+000D)" "New Line (U+000A)"
2625

27-
Comment ::
28-
- `#` CommentChar*
26+
Comment :: `#` CommentChar*
2927

3028
CommentChar :: SourceCharacter but not LineTerminator
3129

@@ -76,10 +74,10 @@ StringValue ::
7674

7775
StringCharacter ::
7876
- SourceCharacter but not `"` or \ or LineTerminator
79-
- \ EscapedUnicode
77+
- \u EscapedUnicode
8078
- \ EscapedCharacter
8179

82-
EscapedUnicode :: u /[0-9A-Fa-f]{4}/
80+
EscapedUnicode :: /[0-9A-Fa-f]{4}/
8381

8482
EscapedCharacter :: one of `"` \ `/` b f n r t
8583

spec/Section 2 -- Language.md

Lines changed: 61 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -13,45 +13,61 @@ double-colon `::`).
1313

1414
## Source Text
1515

16-
SourceCharacter :: "Any Unicode character"
16+
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
1717

1818
GraphQL documents are expressed as a sequence of
1919
[Unicode](http://unicode.org/standard/standard.html) characters. However, with
20-
few exceptions, most of GraphQL is expressed only in the original ASCII range
21-
so as to be as widely compatible with as many existing tools, languages, and
22-
serialization formats as possible. Other than within comments, Non-ASCII Unicode
23-
characters are only found within {StringValue}.
20+
few exceptions, most of GraphQL is expressed only in the original non-control
21+
ASCII range so as to be as widely compatible with as many existing tools,
22+
languages, and serialization formats as possible and avoid display issues in
23+
text editors and source control.
24+
25+
26+
### Unicode
27+
28+
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
29+
30+
Non-ASCII Unicode characters may freely appear within {StringValue} and
31+
{Comment} portions of GraphQL.
32+
33+
The "Byte Order Mark" is a special Unicode character which
34+
may appear at the beginning of a file containing Unicode which programs may use
35+
to determine the fact that the text stream is Unicode, what endianness the text
36+
stream is in, and which of several Unicode encodings to interpret.
2437

2538

2639
### White Space
2740

2841
WhiteSpace ::
2942
- "Horizontal Tab (U+0009)"
30-
- "Vertical Tab (U+000B)"
31-
- "Form Feed (U+000C)"
3243
- "Space (U+0020)"
33-
- "No-break Space (U+00A0)"
3444

3545
White space is used to improve legibility of source text and act as separation
3646
between tokens, and any amount of white space may appear before or after any
3747
token. White space between tokens is not significant to the semantic meaning of
3848
a GraphQL query document, however white space characters may appear within a
3949
{String} or {Comment} token.
4050

51+
Note: GraphQL intentionally does not consider Unicode "Zs" category characters
52+
as white-space, avoiding misinterpretation by text editors and source
53+
control tools.
4154

4255
### Line Terminators
4356

4457
LineTerminator ::
4558
- "New Line (U+000A)"
46-
- "Carriage Return (U+000D)"
47-
- "Line Separator (U+2028)"
48-
- "Paragraph Separator (U+2029)"
59+
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
60+
- "Carriage Return (U+000D)" "New Line (U+000A)"
4961

5062
Like white space, line terminators are used to improve the legibility of source
5163
text, any amount may appear before or after any other token and have no
5264
significance to the semantic meaning of a GraphQL query document. Line
5365
terminators are not found within any other token.
5466

67+
Note: Any error reporting which provide the line number in the source of the
68+
offending syntax should use the preceding amount of {LineTerminator} to produce
69+
the line number.
70+
5571

5672
### Comments
5773

@@ -101,9 +117,11 @@ defined here in a lexical grammar by patterns of source Unicode characters.
101117
Tokens are later used as terminal symbols in a GraphQL query document syntactic
102118
grammars.
103119

120+
104121
### Ignored Tokens
105122

106123
Ignored ::
124+
- UnicodeBOM
107125
- WhiteSpace
108126
- LineTerminator
109127
- Comment
@@ -639,17 +657,46 @@ StringValue ::
639657

640658
StringCharacter ::
641659
- SourceCharacter but not `"` or \ or LineTerminator
642-
- \ EscapedUnicode
660+
- \u EscapedUnicode
643661
- \ EscapedCharacter
644662

645-
EscapedUnicode :: u /[0-9A-Fa-f]{4}/
663+
EscapedUnicode :: /[0-9A-Fa-f]{4}/
646664

647665
EscapedCharacter :: one of `"` \ `/` b f n r t
648666

649-
Strings are lists of characters wrapped in double-quotes `"`. (ex.
667+
Strings are sequences of characters wrapped in double-quotes (`"`). (ex.
650668
`"Hello World"`). White space and other otherwise-ignored characters are
651669
significant within a string value.
652670

671+
Note: Unicode characters are allowed within String value literals, however
672+
GraphQL source must not contain some ASCII control characters so escape
673+
sequences must be used to represent these characters.
674+
675+
**Semantics**
676+
677+
StringValue :: `""`
678+
679+
* Return an empty Unicode character sequence.
680+
681+
StringValue :: `"` StringCharacter+ `"`
682+
683+
* Return the Unicode character sequence of all {StringCharacter}
684+
Unicode character values.
685+
686+
StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator
687+
688+
* Return the character value of {SourceCharacter}.
689+
690+
StringCharacter :: \u EscapedUnicode
691+
692+
* Return the character value represented by the UTF16 hexidecimal
693+
identifier {EscapedUnicode}.
694+
695+
StringCharacter :: \ EscapedCharacter
696+
697+
* Return the character value of {EscapedCharacter}.
698+
699+
653700
#### Enum Value
654701

655702
EnumValue : Name but not `true`, `false` or `null`

0 commit comments

Comments
 (0)