Skip to content

Commit 6f5ad39

Browse files
aphillipseemeli
andauthored
Address name and literal equality (#885)
* Address name and literal equality This change defines equality as discussed in the 2024-09-09 teleconference in the following ways: - It defines _name_ equality as being under NFC - It defines _literal_ equality as explicitly **not** under NFC - It moves _name_ before _identifier_ in that section of text to avoid a forward definition. Note that this deviates from discussion in 2024-09-09's call in that we didn't discuss literals at length. It also doesn't discuss non-name/non-literal values, which I'll point out are limited to ASCII sequences such as keywords. * Typo fix * Add a note about not requiring implementations to actually normalize * Implement changes dicussed in 2024-09-16 call. - Make _key_ require NFC for uniqueness/comparison - Add a note about NFC - Make _literal_ **_not_** define equality - Make text in _name_ identical to that in _key_ for consistency * Update formatting.md to include keys in NFC * Address comments * Update spec/syntax.md Co-authored-by: Eemeli Aro <[email protected]> * Update spec/syntax.md Co-authored-by: Eemeli Aro <[email protected]> --------- Co-authored-by: Eemeli Aro <[email protected]>
1 parent 95ec6d5 commit 6f5ad39

File tree

2 files changed

+53
-16
lines changed

2 files changed

+53
-16
lines changed

spec/formatting.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -502,7 +502,7 @@ Next, using `res`, resolve the preferential order for all message keys:
502502
1. Let `key` be the `var` key at position `i`.
503503
1. If `key` is not the catch-all key `'*'`:
504504
1. Assert that `key` is a _literal_.
505-
1. Let `ks` be the resolved value of `key`.
505+
1. Let `ks` be the resolved value of `key` in Unicode Normalization Form C.
506506
1. Append `ks` as the last element of the list `keys`.
507507
1. Let `rv` be the resolved value at index `i` of `res`.
508508
1. Let `matches` be the result of calling the method MatchSelectorKeys(`rv`, `keys`)
@@ -516,6 +516,9 @@ The returned list MAY be empty.
516516
The most-preferred key is first,
517517
with each successive key appearing in order by decreasing preference.
518518
519+
The resolved value of each _key_ MUST be in Unicode Normalization Form C ("NFC"),
520+
even if the _literal_ for the _key_ is not.
521+
519522
If calling MatchSelectorKeys encounters any error,
520523
a _Bad Selector_ error is emitted
521524
and an empty list is returned.

spec/syntax.md

Lines changed: 49 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -444,6 +444,12 @@ A _key_ can be either a _literal_ value or the "catch-all" key `*`.
444444
The **_<dfn>catch-all key</dfn>_** is a special key, represented by `*`,
445445
that matches all values for a given _selector_.
446446

447+
The value of each _key_ MUST be treated as if it were in
448+
[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC").
449+
Two _keys_ are considered equal if they are canonically equivalent strings,
450+
that is, if they consist of the same sequence of Unicode code points after
451+
Unicode Normalization Form C has been applied to both.
452+
447453
## Expressions
448454

449455
An **_<dfn>expression</dfn>_** is a part of a _message_ that will be determined
@@ -690,6 +696,20 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
690696

691697
All code points are preserved.
692698

699+
> [!IMPORTANT]
700+
> Most text, including that produced by common keyboards and input methods,
701+
> is already encoded in the canonical form known as
702+
> [Unicode Normalization Form C](https://unicode.org/reports/tr15) ("NFC").
703+
> A few languages, legacy character encoding conversions, or operating environments
704+
> can result in _literal_ values that are not in this form.
705+
> Some uses of _literals_ in MessageFormat,
706+
> notably as the value of _keys_,
707+
> apply NFC to the _literal_ value during processing or comparison.
708+
> While there is no requirement that the _literal_ value actually be entered
709+
> in a normalized form,
710+
> users are cautioned to employ the same character sequences
711+
> for equivalent values and, whenever possible, ensure _literals_ are in NFC.
712+
693713
A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
694714
The characters `\` and `|` within a _quoted literal_ MUST be
695715
escaped as `\\` and `\|`.
@@ -714,21 +734,6 @@ number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "
714734

715735
### Names and Identifiers
716736

717-
An **_<dfn>identifier</dfn>_** is a character sequence that
718-
identifies a _function_, _markup_, or _option_.
719-
Each _identifier_ consists of a _name_ optionally preceeded by
720-
a _namespace_.
721-
When present, the _namespace_ is separated from the _name_ by a
722-
U+003A COLON `:`.
723-
Built-in _functions_ and their _options_ do not have a _namespace_ identifier.
724-
725-
The _namespace_ `u` (U+0075 LATIN SMALL LETTER U)
726-
is reserved for future standardization.
727-
728-
_Function_ _identifiers_ are prefixed with `:`.
729-
_Markup_ _identifiers_ are prefixed with `#` or `/`.
730-
_Option_ _identifiers_ have no prefix.
731-
732737
A **_<dfn>name</dfn>_** is a character sequence used in an _identifier_
733738
or as the name for a _variable_
734739
or the value of an _unquoted literal_.
@@ -740,6 +745,20 @@ when matching _name_ or _identifier_ strings or _unquoted literal_ values.
740745

741746
_Variable_ _names_ are prefixed with `$`.
742747

748+
Two _names_ are considered equal if they are canonically equivalent strings,
749+
that is, if they consist of the same sequence of Unicode code points after
750+
[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC")
751+
has been applied to both.
752+
753+
> [!NOTE]
754+
> Implementations are not required to normalize all _names_.
755+
> Comparisons of _name_ values only need be done "as-if" normalization
756+
> has occured.
757+
> Since most text in the wild is already in NFC
758+
> and since checking for NFC is fast and efficient,
759+
> implementations can often substitute checking for actually applying normalization
760+
> to _name_ values.
761+
743762
Valid content for _names_ is based on <cite>Namespaces in XML 1.0</cite>'s
744763
[NCName](https://www.w3.org/TR/xml-names/#NT-NCName).
745764
This is different from XML's [Name](https://www.w3.org/TR/xml/#NT-Name)
@@ -751,6 +770,21 @@ Otherwise, the set of characters allowed in a _name_ is large.
751770
> Such variables cannot be referenced in a _message_,
752771
> but are not otherwise errors.
753772
773+
An **_<dfn>identifier</dfn>_** is a character sequence that
774+
identifies a _function_, _markup_, or _option_.
775+
Each _identifier_ consists of a _name_ optionally preceeded by
776+
a _namespace_.
777+
When present, the _namespace_ is separated from the _name_ by a
778+
U+003A COLON `:`.
779+
Built-in _functions_ and their _options_ do not have a _namespace_ identifier.
780+
781+
The _namespace_ `u` (U+0075 LATIN SMALL LETTER U)
782+
is reserved for future standardization.
783+
784+
_Function_ _identifiers_ are prefixed with `:`.
785+
_Markup_ _identifiers_ are prefixed with `#` or `/`.
786+
_Option_ _identifiers_ have no prefix.
787+
754788
Examples:
755789
> A variable:
756790
>```

0 commit comments

Comments
 (0)