[RFC] Clarify and restrict unicode support

leebyron · leebyron · commit 11fba0296096 · 2015-09-24T14:45:44.000-07:00
This proposal alters the parser grammar to be more specific about what unicode characters are allowed as source, restricts those characters interpretted as white space or line breaks, and clarifies line break behavior relative to error reporting with a non-normative note.
diff --git a/spec/Appendix A -- Notation Conventions.md b/spec/Appendix A -- Notation Conventions.md
@@ -166,13 +166,16 @@ Example_param :
 This specification describes the semantic value of many grammar productions in
 the form of a list of algorithmic steps.
 
-For example, this describes how a parser should interpret a Unicode escape
-sequence which appears in a string literal:
+For example, this describes how a parser should interpret a string literal:
 
-EscapedUnicode :: u /[0-9A-Fa-f]{4}/
+StringValue :: `""`
 
-  * Let {codePoint} be the number represented by the four-digit hexadecimal sequence.
-  * The string value is the Unicode character represented by {codePoint}.
+  * Return an empty Unicode character sequence.
+
+StringValue :: `"` StringCharacter+ `"`
+
+  * Return the Unicode character sequence of all {StringCharacter}
+    Unicode character values.
 
 
 ## Algorithms
diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md
@@ -1,31 +1,29 @@
 # B. Appendix: Grammar Summary
 
-SourceCharacter :: "Any Unicode code point"
+SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
 
 
 ## Ignored Tokens
 
 Ignored ::
+  - UnicodeBOM
   - WhiteSpace
   - LineTerminator
   - Comment
   - Comma
 
+UnicodeBOM :: "Byte Order Mark (U+FEFF)"
+
 WhiteSpace ::
   - "Horizontal Tab (U+0009)"
-  - "Vertical Tab (U+000B)"
-  - "Form Feed (U+000C)"
   - "Space (U+0020)"
-  - "No-break Space (U+00A0)"
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)"
-  - "Line Separator (U+2028)"
-  - "Paragraph Separator (U+2029)"
+  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" "New Line (U+000A)"
 
-Comment ::
-  - `#` CommentChar*
+Comment :: `#` CommentChar*
 
 CommentChar :: SourceCharacter but not LineTerminator
 
@@ -76,10 +74,10 @@ StringValue ::
 
 StringCharacter ::
   - SourceCharacter but not `"` or \ or LineTerminator
-  - \ EscapedUnicode
+  - \u EscapedUnicode
   - \ EscapedCharacter
 
-EscapedUnicode :: u /[0-9A-Fa-f]{4}/
+EscapedUnicode :: /[0-9A-Fa-f]{4}/
 
 EscapedCharacter :: one of `"` \ `/` b f n r t
 
diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md
@@ -13,45 +13,61 @@ double-colon `::`).
 
 ## Source Text
 
-SourceCharacter :: "Any Unicode character"
+SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
 
 GraphQL documents are expressed as a sequence of
 [Unicode](http://unicode.org/standard/standard.html) characters. However, with
-few exceptions, most of GraphQL is expressed only in the original ASCII range
-so as to be as widely compatible with as many existing tools, languages, and
-serialization formats as possible. Other than within comments, Non-ASCII Unicode
-characters are only found within {StringValue}.
+few exceptions, most of GraphQL is expressed only in the original non-control
+ASCII range so as to be as widely compatible with as many existing tools,
+languages, and serialization formats as possible and avoid display issues in
+text editors and source control.
+
+
+### Unicode
+
+UnicodeBOM :: "Byte Order Mark (U+FEFF)"
+
+Non-ASCII Unicode characters may freely appear within {StringValue} and
+{Comment} portions of GraphQL.
+
+The "Byte Order Mark" is a special Unicode character which
+may appear at the beginning of a file containing Unicode which programs may use
+to determine the fact that the text stream is Unicode, what endianness the text
+stream is in, and which of several Unicode encodings to interpret.
 
 
 ### White Space
 
 WhiteSpace ::
   - "Horizontal Tab (U+0009)"
-  - "Vertical Tab (U+000B)"
-  - "Form Feed (U+000C)"
   - "Space (U+0020)"
-  - "No-break Space (U+00A0)"
 
 White space is used to improve legibility of source text and act as separation
 between tokens, and any amount of white space may appear before or after any
 token. White space between tokens is not significant to the semantic meaning of
 a GraphQL query document, however white space characters may appear within a
 {String} or {Comment} token.
 
+Note: GraphQL intentionally does not consider Unicode "Zs" category characters
+as white-space, avoiding misinterpretation by text editors and source
+control tools.
 
 ### Line Terminators
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)"
-  - "Line Separator (U+2028)"
-  - "Paragraph Separator (U+2029)"
+  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" "New Line (U+000A)"
 
 Like white space, line terminators are used to improve the legibility of source
 text, any amount may appear before or after any other token and have no
 significance to the semantic meaning of a GraphQL query document. Line
 terminators are not found within any other token.
 
+Note: Any error reporting which provide the line number in the source of the
+offending syntax should use the preceding amount of {LineTerminator} to produce
+the line number.
+
 
 ### Comments
 
@@ -101,9 +117,11 @@ defined here in a lexical grammar by patterns of source Unicode characters.
 Tokens are later used as terminal symbols in a GraphQL query document syntactic
 grammars.
 
+
 ### Ignored Tokens
 
 Ignored ::
+  - UnicodeBOM
   - WhiteSpace
   - LineTerminator
   - Comment
@@ -639,17 +657,46 @@ StringValue ::
 
 StringCharacter ::
   - SourceCharacter but not `"` or \ or LineTerminator
-  - \ EscapedUnicode
+  - \u EscapedUnicode
   - \ EscapedCharacter
 
-EscapedUnicode :: u /[0-9A-Fa-f]{4}/
+EscapedUnicode :: /[0-9A-Fa-f]{4}/
 
 EscapedCharacter :: one of `"` \ `/` b f n r t
 
-Strings are lists of characters wrapped in double-quotes `"`. (ex.
+Strings are sequences of characters wrapped in double-quotes (`"`). (ex.
 `"Hello World"`). White space and other otherwise-ignored characters are
 significant within a string value.
 
+Note: Unicode characters are allowed within String value literals, however
+GraphQL source must not contain some ASCII control characters so escape
+sequences must be used to represent these characters.
+
+**Semantics**
+
+StringValue :: `""`
+
+  * Return an empty Unicode character sequence.
+
+StringValue :: `"` StringCharacter+ `"`
+
+  * Return the Unicode character sequence of all {StringCharacter}
+    Unicode character values.
+
+StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator
+
+  * Return the character value of {SourceCharacter}.
+
+StringCharacter :: \u EscapedUnicode
+
+  * Return the character value represented by the UTF16 hexidecimal
+    identifier {EscapedUnicode}.
+
+StringCharacter :: \ EscapedCharacter
+
+  * Return the character value of {EscapedCharacter}.
+
+
 #### Enum Value
 
 EnumValue : Name but not `true`, `false` or `null`