Skip to content

Add message parse mode (code vs text) design doc #474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Oct 25, 2023
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
6a9a2eb
Draft message parse mode (code vs text) design doc
eemeli Sep 13, 2023
2109334
Note potential conflict with unquoted string literals
eemeli Sep 13, 2023
77409a5
Update 0474-text-vs-code.md
aphillips Sep 13, 2023
b4b9654
style: Apply Prettier
github-actions[bot] Sep 13, 2023
7670caf
Update 0474-text-vs-code.md
aphillips Sep 13, 2023
b85c0a2
style: Apply Prettier
github-actions[bot] Sep 13, 2023
c248148
scratch pad use of this design document for purely evil reasons
aphillips Sep 13, 2023
e8c1a8f
style: Apply Prettier
github-actions[bot] Sep 13, 2023
5d3d37c
fixing the evil checklist
aphillips Sep 13, 2023
71bfd8e
style: Apply Prettier
github-actions[bot] Sep 13, 2023
cc20854
Remove checklist to the wiki
aphillips Sep 13, 2023
e48e340
style: Apply Prettier
github-actions[bot] Sep 13, 2023
64a66f2
Tweak requirements and use cases
aphillips Sep 22, 2023
d4757b0
style: Apply Prettier
github-actions[bot] Sep 22, 2023
96e3d7e
Proposing a design
aphillips Sep 22, 2023
ebc8016
style: Apply Prettier
github-actions[bot] Sep 22, 2023
cd0a12e
Typo in `match`
aphillips Sep 22, 2023
9fc96f5
Update exploration/0474-text-vs-code.md
eemeli Sep 23, 2023
6aaf850
Merge branch 'main' into text-vs-code
eemeli Oct 24, 2023
cba607a
Rename exploration/0474-text-vs-code.md -> exploration/text-vs-code.md
eemeli Oct 24, 2023
84879bb
Drop the "highly experimental" section documenting a Slack conversation
eemeli Oct 24, 2023
28fdb04
Update alternatives, dropping explicit syntaxes
eemeli Oct 24, 2023
ff50c23
Apply suggestions from code review
eemeli Oct 24, 2023
f898db6
Add "Start in text, encapsulate code, do not trim"
eemeli Oct 24, 2023
692c466
Refer to authors/developers/translators, not "users"
eemeli Oct 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 238 additions & 0 deletions exploration/text-vs-code.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
# Message Parse Mode

Status: **Proposed**

<details>
<summary>Metadata</summary>
<dl>
<dt>Contributors</dt>
<dd>@eemeli</dd>
<dd>@aphillips</dd><!-- Seville and other inserted edits -->
<dt>First proposed</dt>
<dd>2023-09-13</dd>
<dt>Pull Request</dt>
<dd><a href="https://github.com/unicode-org/message-format-wg/pull/474">#474</a></dd>
</dl>
</details>

## Objective

Decide whether text patterns or code statements should be enclosed in MF2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is that clear.

Suggested change
Decide whether text patterns or code statements should be enclosed in MF2.
Decide how to segregate and identify between _pattern_ text and code statements in MF2.
This includes whether parsing a message expects to start with _pattern_ text or with code statements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "how" part would be an extension of what the design doc is currently doing. Is that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but this isn't about "whether text patterns or code statements should be enclosed" but rather about which should be enclosed and when. And the design decision is really about the general syntax (text-vs-code and trimmed-vs-untrimmed-vs-quoted)


## Background

Existing message and template formatting languages tend to start in "text" mode,
and require special syntax like `{{` or `{%` to enter "code" mode.

ICU MessageFormat and Fluent both support inline selectors
separated from the text using `{…}` for multi-variant messages.
ICU MessageFormat is the only known format that uses `{…}` to also delimit text.

[Mustache templates](https://mustache.github.io/mustache.5.html)
and related languages wrap "code" in `{{…}}`.
In addition to placeholders that are replaced by their interpolated value during formatting,
this also includes conditional blocks using `{{#…}}`/`{{/…}}` wrappers.

[Handlebars](https://handlebarsjs.com/guide/) extends Mustache expressions
with operators such as `{{#if …}}` and `{{#each …}}`,
as well as custom formatting functions that become available as e.g. `{{bold …}}`.

[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/) separate
`{% statements %}` and `{{ expressions }}` from the base text.
The former may define tests that determine the inclusion of subsequent text blocks in the output.

A cost that the message formatting and templating languages mentioned above need to rely on
is some rule or behaviour that governs how to deal with whitespace at the beginning and end of a pattern,
as statements may be separated from each other by newlines or other constructs for legibility.

Other formats supporting multiple message variants tend to rely on a surrounding resource format to define variants,
such as [Rails internationalization](https://guides.rubyonrails.org/i18n.html#pluralization) in Ruby or YAML
and [Android String Resources](https://developer.android.com/guide/topics/resources/string-resource.html#Plurals) in XML.
These formats rely on the resource format providing clear delineation of the beginning and end of a pattern.

Based on available data,
no more than 0.3% of all messages and no more than 0.1% of messages with variants
contain leading or trailing whitespace.
No more than one third of this whitespace is localizable,
and most commonly it's due to improper segmentation or other internationalization bugs.

## Use-Cases

Most messages in any localization system do not contain any expressions, statements or variants.
These should be expressible as easily as possible.

Many messages include expressions that are meant to be replaced during formatting.
For example, a greeting like "Hello, {$username}!" would be formatted with the variable
`$username` being replaced by an input variable.

In some rare cases, replacement variables might be added (or removed) in one particular
locale versus messages in other locales.

Sometimes, the replacement variables need to be formatted.
For example, formatting a message like `You have {$distance} kilometers to go`
requires that the numeric value `{$distance}` be formatted as a number according to the locale: `You have 1,234 kilometers to go`.

Formatting of replacement variables might also require tailoring.
For example, if the author wants to show fractions of a kilometer in the above example
they might include a `minimumFractionDigits` option to get a result like
`You have 1,234.5 kilometers to go`.

Some messages need to choose between multiple patterns (called "variants").
For example, this is often related to the handling of numeric values,
in which the pattern used for formatting depends on one of the data values
according to its plural category
(see [CLDR Plural Rules](https://cldr.unicode.org/index/cldr-spec/plural-rules) for more information).
So, in American English, the formatter might need to choose between formatting
`You have 1 kilometer to go` and `You have 2 kilometers to go`.

Rarely, messages needs to include leading or trailing whitespace due to
e.g. how they will be concatenated with other text,
or as a result of being segmented from some larger volume of text.

---

Developers editing a simple message and who wish to add an `input` or `local` annotiation
to the message do not wish to reformat the message extensively.

Developers who have messages that include leading or trailing whitespace
want to ensure that this whitespace is included in the translatable
text portion of the message.
Which whitespace characters are displayed at runtime should not be surprising.

## Requirements

Common things should be easy, uncommon things should be possible.

Developers and translators should be able to read and write the syntax easily in a text editor.

Translators (and their tools) are not software engineers, so we want our syntax
to be as simple, robust, and non-fussy as possible.
Multiple levels of complex nesting should be avoided,
along with any constructs that require an excessive
level of precision on the part of non-technical authors.

As MessageFormat 2 will be at best a secondary language to all its authors and editors,
it should conform to user expectations and require as little learning as possible.

The syntax should avoid footguns,
in particular as it's passed through various tools during formatting.

ASCII-compatible syntax. While support for non-ASCII characters for variable names,
values, literals, options, and the like are important, the syntax itself should
be restricted to ASCII characters. This allows the message to be parsed
visually by humans even when embedded in a syntax that requires escaping.

Whitespace is forgiving.
We _require_ the minimum amount of whitespace and allow
authors to format or change unimportant whitespace as much as they want.
This avoids the need for translators or tools to be super pedantic about
formatting.

## Constraints

Limiting the range of characters that need to be escaped in plain text is important.

The current syntax includes some plain-ascii keywords:
`input`, `local`, `match`, and `when`.

The current syntax and active proposals include some sigil + name combinations,
such as `:number`, `$var`, `|literal|`, `+bold`, and `@attr`.

The current syntax supports unquoted literal values as operands.

Messages themselves are "simple strings" and must be considered to be a single
line of text. In many containing formats, newlines will be represented as the local
equivalent of `\n`.

## Proposed Design

### Start in text, encapsulate code, trim around statements

Allow for message patterns to not be quoted.

Encapsulate with `{…}` or otherwise distinguishing statements from
the primarily unquoted translatable message contents.

For messages with multiple variants,
separate the variants using `when` statements.

Trim whitespace between and around statements such as `input` and `when`,
but do not otherwise trim any leading or trailing whitespace from a message.
This allows for whitespace such as spaces and newlines to be used outside patterns
to make a message more readable.
Comment on lines +159 to +162
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this resulting in surprising misbehavior... consider a developer making the following change, in which removing all inputs from a message that still starts and ends with a line feed would not be expected to affect whitespace:

logOutMessage = ```
-{%input username}
-
-Log out {$username}?
+Log out?
```;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair point. The issue here is that the above edit would add a new line to the message's start, yes? One way to avoid this would be to require a message with statements to not have leading whitespace. That might be a reasonable restriction.

Copy link
Member

@aphillips aphillips Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gibson042's illustration being informative. I would have guessed (if I didn't know anything about MF2) that the "before" of logOutMessage included a newline. With some of the options here, the newline after {%input $username} is consumed, but not the blank link after that. In other options both newlines are consumed unless the user specifically quoted the pattern or whitespace.

Also, note that many formats would encode the example as:

// using our current syntax but without quoting the pattern
var logOutMessage = "{{{#input $username}\n\nLog out {$username}?}}"


Allow for a pattern to be `{{…}}` quoted
such that it preserves its leading and/or trailing whitespace
even when preceded or followed by statements.

## Alternatives Considered

### Start in code, encapsulate text

This approach treats messages as something like a resource format for pattern values.
Keywords are declared directly at the top level of a message,
and patterns are always surrounded by `{{…}}` or some other delimiters.

Whitespace in patterns is never trimmed.

The `{{…}}` are required for all messages,
including ones that only consist of text.
Delimiters of the resource format are required in addition to this,
so messages may appear wrapped as e.g. `"{{…}}"`.

This option is not chosen due to adding an excessive
quoting burden on all messages.
Comment on lines +183 to +184
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not include these "This option is not chosen due..." paragraphs. I think it is okay to call out objective or subjective reasons for why we might not choose a given alternative, e.g.

Suggested change
This option is not chosen due to adding an excessive
quoting burden on all messages.
- This option makes plain text strings invalid as messages.
- This option requires additional quoting for simple messages.

Our choice section should deal with the logic of why a given option was chosen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what you mean by "choice section"? I'm not sure that I understand what that is.


### Start in text, encapsulate code, re-encapsulate text within code

As in the proposed design, simple patterns are unquoted.
Patterns in messages with statements, however,
are required to always be surrounded by `{{…}}` or some other delimiters.

This effectively means that some syntax will "enable" code mode for a message,
and that patterns in such a message need delimiters.

This option is not chosen due to adding an excessive
quoting burden on all multi-variant messages,
as well as introducing an unnecessary additional conceptual layer to the syntax.

### Start in text, encapsulate code, trim minimally

This is the same as the proposed design,
but with a different trimming rule:

- Trim all spaces before and between declarations.
- For single-variant messages, trim one newline after the last declaration.
- For multivariant messages,
trim one space after a `when` statement and
one newline followed by any spaces before a subsequent `when` statement.

This option is not chosen due to the quoting being too magical.
Even though this allows for all patterns with whitespace to not need quotes,
the cost in complexity is too great.

### Start in text, encapsulate code, trim maximally

This is the same as the proposed design,
but with a different trimming rule:

- Trim all leading and trailing whitespace for each pattern.

Expressing the trimming on patterns rather than statements
means that leading and trailing spaces are also trimmed from simple messages.
This option is not chosen due to this being somewhat surprising,
especially when messages are embedded in host formats that have predefined means
of escaping and/or trimming leading and trailing spaces from a value.
Comment on lines +221 to +225
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming simple patterns like this is a bridge too far for me. It should be a separate decision for the "trim XXX" options whether they are trimmed. I can make a plausible argument for why simple patterns should behave differently when trimmed than variant patterns do.


### Start in text, encapsulate code, do not trim

This is the same as the proposed design,
but with two simplifications:

- No whitespace is ever trimmed.
- Quoting a pattern with `{{…}}` is dropped as unnecessary.

With these changes,
all whitespace would need to be explicitly within the "code" part of the syntax,
and patterns could never be separated from statements
without adding whitespace to the pattern.