Skip to content

Commit 3daea5a

Browse files
eemeliaphillipsgithub-actions[bot]
authored
Add message parse mode (code vs text) design doc (#474)
* Draft message parse mode (code vs text) design doc * Note potential conflict with unquoted string literals * Update 0474-text-vs-code.md * style: Apply Prettier * Update 0474-text-vs-code.md * style: Apply Prettier * scratch pad use of this design document for purely evil reasons * style: Apply Prettier * fixing the evil checklist * style: Apply Prettier * Remove checklist to the wiki * style: Apply Prettier * Tweak requirements and use cases * style: Apply Prettier * Proposing a design ... not expecting us to adopt it, but we need to make progress in deciding the specific issues here. * style: Apply Prettier * Typo in `match` ... which is perhaps indicative of an answer to one of the questions about double-bracketing `match`... * Update exploration/0474-text-vs-code.md * Rename exploration/0474-text-vs-code.md -> exploration/text-vs-code.md * Drop the "highly experimental" section documenting a Slack conversation * Update alternatives, dropping explicit syntaxes * Apply suggestions from code review Co-authored-by: Addison Phillips <[email protected]> * Add "Start in text, encapsulate code, do not trim" * Refer to authors/developers/translators, not "users" --------- Co-authored-by: Addison Phillips <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 5065a51 commit 3daea5a

File tree

1 file changed

+238
-0
lines changed

1 file changed

+238
-0
lines changed

exploration/text-vs-code.md

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
# Message Parse Mode
2+
3+
Status: **Proposed**
4+
5+
<details>
6+
<summary>Metadata</summary>
7+
<dl>
8+
<dt>Contributors</dt>
9+
<dd>@eemeli</dd>
10+
<dd>@aphillips</dd><!-- Seville and other inserted edits -->
11+
<dt>First proposed</dt>
12+
<dd>2023-09-13</dd>
13+
<dt>Pull Request</dt>
14+
<dd><a href="https://github.com/unicode-org/message-format-wg/pull/474">#474</a></dd>
15+
</dl>
16+
</details>
17+
18+
## Objective
19+
20+
Decide whether text patterns or code statements should be enclosed in MF2.
21+
22+
## Background
23+
24+
Existing message and template formatting languages tend to start in "text" mode,
25+
and require special syntax like `{{` or `{%` to enter "code" mode.
26+
27+
ICU MessageFormat and Fluent both support inline selectors
28+
separated from the text using `{…}` for multi-variant messages.
29+
ICU MessageFormat is the only known format that uses `{…}` to also delimit text.
30+
31+
[Mustache templates](https://mustache.github.io/mustache.5.html)
32+
and related languages wrap "code" in `{{…}}`.
33+
In addition to placeholders that are replaced by their interpolated value during formatting,
34+
this also includes conditional blocks using `{{#…}}`/`{{/…}}` wrappers.
35+
36+
[Handlebars](https://handlebarsjs.com/guide/) extends Mustache expressions
37+
with operators such as `{{#if …}}` and `{{#each …}}`,
38+
as well as custom formatting functions that become available as e.g. `{{bold …}}`.
39+
40+
[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/) separate
41+
`{% statements %}` and `{{ expressions }}` from the base text.
42+
The former may define tests that determine the inclusion of subsequent text blocks in the output.
43+
44+
A cost that the message formatting and templating languages mentioned above need to rely on
45+
is some rule or behaviour that governs how to deal with whitespace at the beginning and end of a pattern,
46+
as statements may be separated from each other by newlines or other constructs for legibility.
47+
48+
Other formats supporting multiple message variants tend to rely on a surrounding resource format to define variants,
49+
such as [Rails internationalization](https://guides.rubyonrails.org/i18n.html#pluralization) in Ruby or YAML
50+
and [Android String Resources](https://developer.android.com/guide/topics/resources/string-resource.html#Plurals) in XML.
51+
These formats rely on the resource format providing clear delineation of the beginning and end of a pattern.
52+
53+
Based on available data,
54+
no more than 0.3% of all messages and no more than 0.1% of messages with variants
55+
contain leading or trailing whitespace.
56+
No more than one third of this whitespace is localizable,
57+
and most commonly it's due to improper segmentation or other internationalization bugs.
58+
59+
## Use-Cases
60+
61+
Most messages in any localization system do not contain any expressions, statements or variants.
62+
These should be expressible as easily as possible.
63+
64+
Many messages include expressions that are meant to be replaced during formatting.
65+
For example, a greeting like "Hello, {$username}!" would be formatted with the variable
66+
`$username` being replaced by an input variable.
67+
68+
In some rare cases, replacement variables might be added (or removed) in one particular
69+
locale versus messages in other locales.
70+
71+
Sometimes, the replacement variables need to be formatted.
72+
For example, formatting a message like `You have {$distance} kilometers to go`
73+
requires that the numeric value `{$distance}` be formatted as a number according to the locale: `You have 1,234 kilometers to go`.
74+
75+
Formatting of replacement variables might also require tailoring.
76+
For example, if the author wants to show fractions of a kilometer in the above example
77+
they might include a `minimumFractionDigits` option to get a result like
78+
`You have 1,234.5 kilometers to go`.
79+
80+
Some messages need to choose between multiple patterns (called "variants").
81+
For example, this is often related to the handling of numeric values,
82+
in which the pattern used for formatting depends on one of the data values
83+
according to its plural category
84+
(see [CLDR Plural Rules](https://cldr.unicode.org/index/cldr-spec/plural-rules) for more information).
85+
So, in American English, the formatter might need to choose between formatting
86+
`You have 1 kilometer to go` and `You have 2 kilometers to go`.
87+
88+
Rarely, messages needs to include leading or trailing whitespace due to
89+
e.g. how they will be concatenated with other text,
90+
or as a result of being segmented from some larger volume of text.
91+
92+
---
93+
94+
Developers editing a simple message and who wish to add an `input` or `local` annotiation
95+
to the message do not wish to reformat the message extensively.
96+
97+
Developers who have messages that include leading or trailing whitespace
98+
want to ensure that this whitespace is included in the translatable
99+
text portion of the message.
100+
Which whitespace characters are displayed at runtime should not be surprising.
101+
102+
## Requirements
103+
104+
Common things should be easy, uncommon things should be possible.
105+
106+
Developers and translators should be able to read and write the syntax easily in a text editor.
107+
108+
Translators (and their tools) are not software engineers, so we want our syntax
109+
to be as simple, robust, and non-fussy as possible.
110+
Multiple levels of complex nesting should be avoided,
111+
along with any constructs that require an excessive
112+
level of precision on the part of non-technical authors.
113+
114+
As MessageFormat 2 will be at best a secondary language to all its authors and editors,
115+
it should conform to user expectations and require as little learning as possible.
116+
117+
The syntax should avoid footguns,
118+
in particular as it's passed through various tools during formatting.
119+
120+
ASCII-compatible syntax. While support for non-ASCII characters for variable names,
121+
values, literals, options, and the like are important, the syntax itself should
122+
be restricted to ASCII characters. This allows the message to be parsed
123+
visually by humans even when embedded in a syntax that requires escaping.
124+
125+
Whitespace is forgiving.
126+
We _require_ the minimum amount of whitespace and allow
127+
authors to format or change unimportant whitespace as much as they want.
128+
This avoids the need for translators or tools to be super pedantic about
129+
formatting.
130+
131+
## Constraints
132+
133+
Limiting the range of characters that need to be escaped in plain text is important.
134+
135+
The current syntax includes some plain-ascii keywords:
136+
`input`, `local`, `match`, and `when`.
137+
138+
The current syntax and active proposals include some sigil + name combinations,
139+
such as `:number`, `$var`, `|literal|`, `+bold`, and `@attr`.
140+
141+
The current syntax supports unquoted literal values as operands.
142+
143+
Messages themselves are "simple strings" and must be considered to be a single
144+
line of text. In many containing formats, newlines will be represented as the local
145+
equivalent of `\n`.
146+
147+
## Proposed Design
148+
149+
### Start in text, encapsulate code, trim around statements
150+
151+
Allow for message patterns to not be quoted.
152+
153+
Encapsulate with `{…}` or otherwise distinguishing statements from
154+
the primarily unquoted translatable message contents.
155+
156+
For messages with multiple variants,
157+
separate the variants using `when` statements.
158+
159+
Trim whitespace between and around statements such as `input` and `when`,
160+
but do not otherwise trim any leading or trailing whitespace from a message.
161+
This allows for whitespace such as spaces and newlines to be used outside patterns
162+
to make a message more readable.
163+
164+
Allow for a pattern to be `{{…}}` quoted
165+
such that it preserves its leading and/or trailing whitespace
166+
even when preceded or followed by statements.
167+
168+
## Alternatives Considered
169+
170+
### Start in code, encapsulate text
171+
172+
This approach treats messages as something like a resource format for pattern values.
173+
Keywords are declared directly at the top level of a message,
174+
and patterns are always surrounded by `{{…}}` or some other delimiters.
175+
176+
Whitespace in patterns is never trimmed.
177+
178+
The `{{…}}` are required for all messages,
179+
including ones that only consist of text.
180+
Delimiters of the resource format are required in addition to this,
181+
so messages may appear wrapped as e.g. `"{{…}}"`.
182+
183+
This option is not chosen due to adding an excessive
184+
quoting burden on all messages.
185+
186+
### Start in text, encapsulate code, re-encapsulate text within code
187+
188+
As in the proposed design, simple patterns are unquoted.
189+
Patterns in messages with statements, however,
190+
are required to always be surrounded by `{{…}}` or some other delimiters.
191+
192+
This effectively means that some syntax will "enable" code mode for a message,
193+
and that patterns in such a message need delimiters.
194+
195+
This option is not chosen due to adding an excessive
196+
quoting burden on all multi-variant messages,
197+
as well as introducing an unnecessary additional conceptual layer to the syntax.
198+
199+
### Start in text, encapsulate code, trim minimally
200+
201+
This is the same as the proposed design,
202+
but with a different trimming rule:
203+
204+
- Trim all spaces before and between declarations.
205+
- For single-variant messages, trim one newline after the last declaration.
206+
- For multivariant messages,
207+
trim one space after a `when` statement and
208+
one newline followed by any spaces before a subsequent `when` statement.
209+
210+
This option is not chosen due to the quoting being too magical.
211+
Even though this allows for all patterns with whitespace to not need quotes,
212+
the cost in complexity is too great.
213+
214+
### Start in text, encapsulate code, trim maximally
215+
216+
This is the same as the proposed design,
217+
but with a different trimming rule:
218+
219+
- Trim all leading and trailing whitespace for each pattern.
220+
221+
Expressing the trimming on patterns rather than statements
222+
means that leading and trailing spaces are also trimmed from simple messages.
223+
This option is not chosen due to this being somewhat surprising,
224+
especially when messages are embedded in host formats that have predefined means
225+
of escaping and/or trimming leading and trailing spaces from a value.
226+
227+
### Start in text, encapsulate code, do not trim
228+
229+
This is the same as the proposed design,
230+
but with two simplifications:
231+
232+
- No whitespace is ever trimmed.
233+
- Quoting a pattern with `{{…}}` is dropped as unnecessary.
234+
235+
With these changes,
236+
all whitespace would need to be explicitly within the "code" part of the syntax,
237+
and patterns could never be separated from statements
238+
without adding whitespace to the pattern.

0 commit comments

Comments
 (0)