Skip to content

Document the design of quoted literals #477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Nov 27, 2023
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
681963f
Document the design of quoted literals
stasm Sep 17, 2023
6fd7cc1
style: Apply Prettier
github-actions[bot] Sep 17, 2023
1fe4a4c
Rename file to match the PR
stasm Sep 17, 2023
10f7354
Add metadata
stasm Sep 17, 2023
d37b893
Address Eemeli's review comments
stasm Sep 18, 2023
49e5f0b
Split "Dual quoting" from "Use quotation marks"
eemeli Sep 18, 2023
0ae3235
Apply suggestions by Richard
stasm Sep 18, 2023
4aea744
style: Apply Prettier
github-actions[bot] Sep 18, 2023
4b84ef9
Use a modified RFC 3339 datetime format
stasm Sep 19, 2023
19b9e83
Make the note a note
stasm Sep 19, 2023
b0dddf0
Add more examples of exotic variant keys
stasm Sep 19, 2023
d5975c4
quotes are not automatically paired
stasm Sep 19, 2023
dafa8b7
Use sembrs
stasm Sep 20, 2023
842ccda
Add a comparison table
stasm Sep 24, 2023
b92eba1
Change angle brackets' r2 from POOR to FAIR
stasm Sep 24, 2023
49dd24d
Use single quotes for the JS example
stasm Sep 24, 2023
bed3efe
Update exploration/0477-quoted-literals.md
stasm Nov 10, 2023
d65481c
Drop the PR number from the filename
stasm Nov 9, 2023
657e0ff
Update to the current syntax
stasm Nov 10, 2023
670bd09
Address some of review comments
stasm Nov 10, 2023
496a409
Add alternative to allow either ', ", or | as delimiters
eemeli Nov 10, 2023
d2a3b44
Clarify r3; add r6
stasm Nov 13, 2023
9495a96
Apply suggestions from code review
stasm Nov 13, 2023
f82ba41
Fix formatting
stasm Nov 13, 2023
89494c5
Factor r6 back into r2; apply @eemeli's suggestion about other reason…
stasm Nov 17, 2023
815b822
Factor r3 into r1; rename other reqs
stasm Nov 17, 2023
de03143
Add r5 about one way of doing things
stasm Nov 17, 2023
3d3798b
Add priorities to the table
stasm Nov 17, 2023
b9c149c
Fix the Zen of Python quote
stasm Nov 17, 2023
1f3f1b9
Apply suggestions from code review
eemeli Nov 20, 2023
e15390a
Use a Markdown table
stasm Nov 27, 2023
709318a
Update exploration/quoted-literals.md
aphillips Nov 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 345 additions & 0 deletions exploration/quoted-literals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
# Quoted Literals

Status: **Proposed**

<details>
<summary>Metadata</summary>
<dl>
<dt>Pull Request</dt>
<dd><a href="https://github.com/unicode-org/message-format-wg/pull/477">#477</a></dd>
</dl>
</details>

## Objective

Document the rationale for including quoted literals in MessageFormat
and for delimiting them with the vertical line character, `|`.

## Background

MessageFormat allows both quoted and unquoted literals.
Unquoted literals satisfy many common use-cases for literals:
they are sufficient to represent numbers
and single-word option values and variant keys.
Quoted literals are helpful in exotic use-cases.

In early drafts of the MessageFormat syntax,
quoted literals used to be delimited first with quotation marks (`"foo bar"`),
and then with round parentheses, e.g. `(foo bar)`.
See [#263](https://github.com/unicode-org/message-format-wg/issues/263).

[#414](https://github.com/unicode-org/message-format-wg/pull/414) proposed to revert these changes
and go back to using single and/or double quotes as delimiters.
The propsal was rejected.
This document is an artifact of that rejection.

## Use-Cases

_What use-cases do we see? Ideally, quote concrete examples._

In general, quoted literals are useful for:

1. encoding literals containing whitespace, like literals consisting of multiple words,
1. encoding literals containing exotic characters that do not conform to the `unquoted` production in ABNF.

More specifically:

- Message authors and translators need to be able to use the apostrophe in the message content,
and may want to use the single quote character
to represent it instead of the typograhic (curly) apostrophe.

> ```
> …{|New Year's Eve|}…
> ```

- Message authors may want to use literals to define locale-aware dates as literals in a modified RFC 3339 format:

> ```
> The Unix epoch is defined as {|1970-01-01 00:00:00Z| :datetime}.
> ```

- Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements:

> ```
> {+button title=|Click here!|}Submit{-button}
> ```

> [!NOTE]
> Quoted literals are not evaluated as part of a pattern or option sequence.
> This means that their contents cannot be dynamic.
> ```
> -- The "title" contains the string "{$userName}"
> {+button title=|Goodbye, {$userName}!|}Sign out{-button}
> ```

- Selector function implementers might need to match different string values
such as those present in data values.
These might include keys containing arbitrary text, multiple words,
or other sequences not otherwise permitted in the syntax.

> ```
> {{ match {$count :choice}
> when |<10| {{A handful.}}
> when |11..19| {{Umpteen.}}
> when * {{Lots.}}
> }}
>
> {{ match {$arbitraryString}
> when |can't resolve| {{Can't resolve!}}
> when |11'233.44| {{Locale formatted number}}
> when |New York| {{A multi-word proper name}}
> when * {{Imagine more...}}
> }}
> ```

- Message authors may want to protect untranslatable strings:

> ```
> Visit {|http://www.example.com| @translate=false}.
> ```
>
> See the [expression attributes design proposal](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md).

- Message authors may want to decorate substrings as being written in a particular language,
different from the message's language,
for the purpose of accessibility, text-to-speech, and semantic correctness.

> ```
> The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}.
> ```
>
> See the [expression attributes design proposal](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md).

- Developers may want to embed messages with quoted literals in code written in another programming language
which uses single or double quotes to delimit strings.

> ```js
> let message = new MessageFormat('en', 'A message with {|a literal|}.');
> ```

- Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON.

> ```json
> {
> "msg": "A message with {|a literal|}."
> }
> ```

## Requirements

_What properties does the solution have to manifest to enable the use-cases above?_

- **[r1; high priority]** Minimize the need to escape characters inside literals.
In particular, choose a delimiter that isn't frequently used in translation content.
Having to escape characters inside literals is inconvenient and error-prone when done by hand,
and it also introduces the backslash `\` into the message as the escape introducer.
When the message is embedded in code or containers, the backslash then needs to be escaped too;
this is how some syntaxes produce the gnarly `\\\`.

By minimizing the need to escape characters,
we also minimze the incentive to _avoid_ escaping by changing translation content,
e.g. by rephrasing content or by using typographic punctuation marks.

- **[r2; medium priority]** Minimize the need to escape characters or change the host format's string delimiters when embedding messages in code or containers.
In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats.

This requirement is scored as _medium_, because many storage formats don't use delimiters at all (`.properties`, YAML),
or they are meant to be primarily used by machines (JSON),
and because many programming languages provide a way to delimit _raw strings_,
e.g. via <code>``</code> in JavaScript and `"""` in Python.
Also, messages including e.g. newlines or `\` escapes in their source
will likely need those characters accounted for when dropping them into new host formats.

- **[r3; medium/high priority]** Do not surprise users with syntax that's too exotic.
We expect quoted literals to be rare,
which means fewer opportunities to get used to their syntax and remember it.

- **[r4; low priority]** Be able to pair the opening and the closing delimiter,
to aid parsers recover from syntax errors,
and to leverage IDE's ability to highlight matching pairs of delimiters,
to visually indicate to the user editing a message the bounds of the literal under caret.
However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters)
or are outside patterns (when used as variant keys).

<details>
<summary>How can paired delimiters improve parsing recovery?</summary>
If both paired delimiters are made special in the literal,
i.e. both the opening and the closing delimiter require escaping inside the literal to be part of its contents,
then the start of another literal can be an anchor point for a parser to stop parsing and attempt to rewind and recover.

```
There {:is a=|broken literal=|here|}
^ ^
The closing delimiter is missing here.
The syntax error occurs here.
There {:is a=[broken literal=[here]}
^^ ^
The closing delimiter is missing here.
| The parser can recognize a new literal here...
and rewind to here.
```
</details>

- **[r5; low priority]** Do not require users to choose between too many syntax options.
> There should be one — and preferably only one — obvious way to do it.<br>
> —_[The Zen of Python](https://peps.python.org/pep-0020/)_

## Constraints

_What prior decisions and existing conditions limit the possible design?_

- **[c1]** MessageFormat uses the backslash, `\`,
as the escape sequence introducer.

- **[c2]** Straight quotation marks, `'` and `"`,
are common in content across many languages,
even if other Unicode codepoints should be used in well-formatted text.

- **[c3]** Straight quotation marks, `'` and `"`,
are common as string delimiters in many programming languages.

## Proposed Design

_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._

Use the vertical line character, `|`, to delimit quoted strings.
The vertical line is rarely found in text content,
and it has sufficiently good delimitation properties.

> ```
> {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.}
> ```

```abnf
literal = quoted / unquoted
quoted = "|" *(quoted-char / quoted-escape) "|"
quoted-char = %x0-5B ; omit \
/ %x5D-7B ; omit |
/ %x7D-D7FF ; omit surrogates
/ %xE000-10FFFF
quoted-escape = backslash ( backslash / "|" )
```

By being both uncommon in text content and uncommon as a string delimiter in other programming languages,
the vertical line sidesteps the "inwards" and "outwards" problems of escaping.

- [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`.
This means no extra `\` that need escaping.
- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters.
- [r3 POOR/FAIR] Vertical lines are not commonly used as string delimiters
and thus can be harder to learn for beginners.
Vertical bars can be used as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html).
However, typically vertical lines tend to be used as delimiters for *separating* rather than for *enclosing*.
- [r4 POOR] Vertical lines are not automatically paired by parsers nor IDEs.

## Alternatives Considered

_What other solutions are available?_
_How do they compare against the requirements?_
_What other properties they have?_

### [a1] Use quotation marks

Early drafts of the syntax specification used double quotes to delimit literals.
This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015).

- [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`,
which then needs to be escaped itself in code
which uses `\` as the escape character (which is common).
- [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters.
Most notably, storing MF2 messages in JSON suffers from this.
In many programming languages, however, alternatives to quotation marks exist,
which could be used to allow unescaped quotes in messages.
See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542).
- [r3 GOOD] Quotation marks are universally recognized as string delimiters.
- [r4 FAIR] Quotation marks are not automatically paired by parsers nor IDEs,
but many text editors provide features to make working with and around quotes easier.

### [a2] Dual quoting

PR [#414](https://github.com/unicode-org/message-format-wg/pull/414) proposes to
allow either single quotes `'` or double quotes `"` as literal delimiters,
a variant of the "Use quotation marks" solution.

- [r1 FAIR] Writing `"` and `'` in literals doesn't require escaping them via `\`,
as long as they do not match the literal's delimiter.
Literals containing both `'` and `"` will need to have at least one of those characters
escaped via `\`, which may itself need escaping in the container format.
- [r2 GOOD] Embedding messages in certain container formats requires escaping the literal delimiters.
If the container format does not itself support dual quoting,
the embedded message's quotes may be adjusted to avoid their escaping.
- [r3 GOOD] Quotation marks are universally recognized as string delimiters.
- [r4 FAIR] Quotation marks cannot be paired by parsers nor IDEs,
but many text editors provide features to make working with and around quotes easier.

### [a3] Use round or angle brackets

- Round parentheses are very uncommon as string delimiters [r2 GOOD],
and thus may be surprising,
especially given the well-established meaning in prose [r4 POOR].
That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22).
Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR].
- Angle brackets require escaping in XML-based storage formats [r2 FAIR].
- All brackets can be easily paired by parsers and IDEs [r5 GOOD].

### [a4] Change escape introducer

Changing the escape sequence introducer from backslash [c1] to another character
could help partially mitigate the burden of first escaping literal delimiters
and then escaping the escapes themselves [r1].
However, it wouldn't address other requirements and use-cases.

### [a5] Double delimiters to escape them

This is the approach taken by ICU MessageFormat 1.0 for quotes.
It allows literals to contain quotes [r1 GOOD]
at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR].

### [a6] Accept either `|` or quotes

Allow any of the following as literal delimiters:

- the vertical line character `|`
- single quotes `'`
- double quotes `"`

This approach supports multiple different quoting styles to be used for literals.
This flexibility allows for using a familiar and common style such as `'single'` or `"double"` quotes,
while also allowing for `|pipes|` when the message's contents or embedding would otherwise require additional escaping.
This means that literals could for example prefer `'single quotes'`,
but use `"double 'em"` if necessary,
or `|'pipe' characters|` if the whole message is wrapped in `"quotes"` due to the host format
or if the literal value contains both `'` and `"` quotes.

```abnf
literal = quoted / unquoted
quoted = "|" *(quoted-char / "'" / DQUOTE / quoted-escape) "|"
/ "'" *(quoted-char / DQUOTE / "|" / quoted-escape) "'"
/ DQUOTE *(quoted-char / "'" / "|" / quoted-escape) DQUOTE
quoted-char = %x0-21 ; omit "
/ %x23-26 ; omit '
/ %x28-5B ; omit \
/ %x5D-7B ; omit |
/ %x7D-D7FF ; omit surrogates
/ %xE000-10FFFF
quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE )
```

- [r1 GOOD] Writing any two of `|`, `"` and `'` in literals doesn't require escaping them via `\`.
This means no extra `\` that need escaping.
Message don't have to be modified otherwise before embedding them,
unless they happen to contain conflicting quote delimiters.
Comment on lines +330 to +331
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a con "in a trenchcoat" :-)

Messages MUST be modified if they "happen to contain conflicting quote delimiters", e.g. those that conflict with the syntax.

My big concern here is: how to do tools up and down the translation stack decide what quotes to use?

There should be a "con" somewhere for having multiple ways of doing the same thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's r6, and it's included in the table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry, typo above: I meant r5, the "one way" requirement.

- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters.
- [r3 GOOD] Quotation marks are universally recognized as string delimiters.
- [r4 FAIR] Using the same marks for quote-start and quote-end cannot be paired by parsers nor IDEs,
but many text editors provide features to make working with and around quotes easier.

## Comparison table

| | Priority | Proposal | [a1] | [a2] | [a3] | [a4] | [a5] | [a6] |
|-----------------------------|----------|:--------:|:----:|:----:|:----:|:----:|:----:|:----:|
| [r1] escape inside literals | HIGH | ++ | - | + | - | ++ | ++ | ++ |
| [r2] escape when embedding | MED | ++ | + | ++ | +/++ | | - | ++ |
| [r3] no surprises | MED/HIGH | -/+ | ++ | ++ | - | - | + | ++ |
| [r4] pair delimiters | LOW | - | + | + | ++ | | | + |
| [r5] one way | LOW | ++ | ++ | + | ++ | | | - |