|
| 1 | +# Message Parse Mode |
| 2 | + |
| 3 | +Status: **Proposed** |
| 4 | + |
| 5 | +<details> |
| 6 | + <summary>Metadata</summary> |
| 7 | + <dl> |
| 8 | + <dt>Contributors</dt> |
| 9 | + <dd>@eemeli</dd> |
| 10 | + <dd>@aphillips</dd><!-- Seville and other inserted edits --> |
| 11 | + <dt>First proposed</dt> |
| 12 | + <dd>2023-09-13</dd> |
| 13 | + <dt>Pull Request</dt> |
| 14 | + <dd><a href="https://github.com/unicode-org/message-format-wg/pull/474">#474</a></dd> |
| 15 | + </dl> |
| 16 | +</details> |
| 17 | + |
| 18 | +## Objective |
| 19 | + |
| 20 | +Decide whether text patterns or code statements should be enclosed in MF2. |
| 21 | + |
| 22 | +## Background |
| 23 | + |
| 24 | +Existing message and template formatting languages tend to start in "text" mode, |
| 25 | +and require special syntax like `{{` or `{%` to enter "code" mode. |
| 26 | + |
| 27 | +ICU MessageFormat and Fluent both support inline selectors |
| 28 | +separated from the text using `{…}` for multi-variant messages. |
| 29 | +ICU MessageFormat is the only known format that uses `{…}` to also delimit text. |
| 30 | + |
| 31 | +[Mustache templates](https://mustache.github.io/mustache.5.html) |
| 32 | +and related languages wrap "code" in `{{…}}`. |
| 33 | +In addition to placeholders that are replaced by their interpolated value during formatting, |
| 34 | +this also includes conditional blocks using `{{#…}}`/`{{/…}}` wrappers. |
| 35 | + |
| 36 | +[Handlebars](https://handlebarsjs.com/guide/) extends Mustache expressions |
| 37 | +with operators such as `{{#if …}}` and `{{#each …}}`, |
| 38 | +as well as custom formatting functions that become available as e.g. `{{bold …}}`. |
| 39 | + |
| 40 | +[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/) separate |
| 41 | +`{% statements %}` and `{{ expressions }}` from the base text. |
| 42 | +The former may define tests that determine the inclusion of subsequent text blocks in the output. |
| 43 | + |
| 44 | +A cost that the message formatting and templating languages mentioned above need to rely on |
| 45 | +is some rule or behaviour that governs how to deal with whitespace at the beginning and end of a pattern, |
| 46 | +as statements may be separated from each other by newlines or other constructs for legibility. |
| 47 | + |
| 48 | +Other formats supporting multiple message variants tend to rely on a surrounding resource format to define variants, |
| 49 | +such as [Rails internationalization](https://guides.rubyonrails.org/i18n.html#pluralization) in Ruby or YAML |
| 50 | +and [Android String Resources](https://developer.android.com/guide/topics/resources/string-resource.html#Plurals) in XML. |
| 51 | +These formats rely on the resource format providing clear delineation of the beginning and end of a pattern. |
| 52 | + |
| 53 | +Based on available data, |
| 54 | +no more than 0.3% of all messages and no more than 0.1% of messages with variants |
| 55 | +contain leading or trailing whitespace. |
| 56 | +No more than one third of this whitespace is localizable, |
| 57 | +and most commonly it's due to improper segmentation or other internationalization bugs. |
| 58 | + |
| 59 | +## Use-Cases |
| 60 | + |
| 61 | +Most messages in any localization system do not contain any expressions, statements or variants. |
| 62 | +These should be expressible as easily as possible. |
| 63 | + |
| 64 | +Many messages include expressions that are meant to be replaced during formatting. |
| 65 | +For example, a greeting like "Hello, {$username}!" would be formatted with the variable |
| 66 | +`$username` being replaced by an input variable. |
| 67 | + |
| 68 | +In some rare cases, replacement variables might be added (or removed) in one particular |
| 69 | +locale versus messages in other locales. |
| 70 | + |
| 71 | +Sometimes, the replacement variables need to be formatted. |
| 72 | +For example, formatting a message like `You have {$distance} kilometers to go` |
| 73 | +requires that the numeric value `{$distance}` be formatted as a number according to the locale: `You have 1,234 kilometers to go`. |
| 74 | + |
| 75 | +Formatting of replacement variables might also require tailoring. |
| 76 | +For example, if the author wants to show fractions of a kilometer in the above example |
| 77 | +they might include a `minimumFractionDigits` option to get a result like |
| 78 | +`You have 1,234.5 kilometers to go`. |
| 79 | + |
| 80 | +Some messages need to choose between multiple patterns (called "variants"). |
| 81 | +For example, this is often related to the handling of numeric values, |
| 82 | +in which the pattern used for formatting depends on one of the data values |
| 83 | +according to its plural category |
| 84 | +(see [CLDR Plural Rules](https://cldr.unicode.org/index/cldr-spec/plural-rules) for more information). |
| 85 | +So, in American English, the formatter might need to choose between formatting |
| 86 | +`You have 1 kilometer to go` and `You have 2 kilometers to go`. |
| 87 | + |
| 88 | +Rarely, messages needs to include leading or trailing whitespace due to |
| 89 | +e.g. how they will be concatenated with other text, |
| 90 | +or as a result of being segmented from some larger volume of text. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +Developers editing a simple message and who wish to add an `input` or `local` annotiation |
| 95 | +to the message do not wish to reformat the message extensively. |
| 96 | + |
| 97 | +Developers who have messages that include leading or trailing whitespace |
| 98 | +want to ensure that this whitespace is included in the translatable |
| 99 | +text portion of the message. |
| 100 | +Which whitespace characters are displayed at runtime should not be surprising. |
| 101 | + |
| 102 | +## Requirements |
| 103 | + |
| 104 | +Common things should be easy, uncommon things should be possible. |
| 105 | + |
| 106 | +Developers and translators should be able to read and write the syntax easily in a text editor. |
| 107 | + |
| 108 | +Translators (and their tools) are not software engineers, so we want our syntax |
| 109 | +to be as simple, robust, and non-fussy as possible. |
| 110 | +Multiple levels of complex nesting should be avoided, |
| 111 | +along with any constructs that require an excessive |
| 112 | +level of precision on the part of non-technical authors. |
| 113 | + |
| 114 | +As MessageFormat 2 will be at best a secondary language to all its authors and editors, |
| 115 | +it should conform to user expectations and require as little learning as possible. |
| 116 | + |
| 117 | +The syntax should avoid footguns, |
| 118 | +in particular as it's passed through various tools during formatting. |
| 119 | + |
| 120 | +ASCII-compatible syntax. While support for non-ASCII characters for variable names, |
| 121 | +values, literals, options, and the like are important, the syntax itself should |
| 122 | +be restricted to ASCII characters. This allows the message to be parsed |
| 123 | +visually by humans even when embedded in a syntax that requires escaping. |
| 124 | + |
| 125 | +Whitespace is forgiving. |
| 126 | +We _require_ the minimum amount of whitespace and allow |
| 127 | +authors to format or change unimportant whitespace as much as they want. |
| 128 | +This avoids the need for translators or tools to be super pedantic about |
| 129 | +formatting. |
| 130 | + |
| 131 | +## Constraints |
| 132 | + |
| 133 | +Limiting the range of characters that need to be escaped in plain text is important. |
| 134 | + |
| 135 | +The current syntax includes some plain-ascii keywords: |
| 136 | +`input`, `local`, `match`, and `when`. |
| 137 | + |
| 138 | +The current syntax and active proposals include some sigil + name combinations, |
| 139 | +such as `:number`, `$var`, `|literal|`, `+bold`, and `@attr`. |
| 140 | + |
| 141 | +The current syntax supports unquoted literal values as operands. |
| 142 | + |
| 143 | +Messages themselves are "simple strings" and must be considered to be a single |
| 144 | +line of text. In many containing formats, newlines will be represented as the local |
| 145 | +equivalent of `\n`. |
| 146 | + |
| 147 | +## Proposed Design |
| 148 | + |
| 149 | +### Start in text, encapsulate code, trim around statements |
| 150 | + |
| 151 | +Allow for message patterns to not be quoted. |
| 152 | + |
| 153 | +Encapsulate with `{…}` or otherwise distinguishing statements from |
| 154 | +the primarily unquoted translatable message contents. |
| 155 | + |
| 156 | +For messages with multiple variants, |
| 157 | +separate the variants using `when` statements. |
| 158 | + |
| 159 | +Trim whitespace between and around statements such as `input` and `when`, |
| 160 | +but do not otherwise trim any leading or trailing whitespace from a message. |
| 161 | +This allows for whitespace such as spaces and newlines to be used outside patterns |
| 162 | +to make a message more readable. |
| 163 | + |
| 164 | +Allow for a pattern to be `{{…}}` quoted |
| 165 | +such that it preserves its leading and/or trailing whitespace |
| 166 | +even when preceded or followed by statements. |
| 167 | + |
| 168 | +## Alternatives Considered |
| 169 | + |
| 170 | +### Start in code, encapsulate text |
| 171 | + |
| 172 | +This approach treats messages as something like a resource format for pattern values. |
| 173 | +Keywords are declared directly at the top level of a message, |
| 174 | +and patterns are always surrounded by `{{…}}` or some other delimiters. |
| 175 | + |
| 176 | +Whitespace in patterns is never trimmed. |
| 177 | + |
| 178 | +The `{{…}}` are required for all messages, |
| 179 | +including ones that only consist of text. |
| 180 | +Delimiters of the resource format are required in addition to this, |
| 181 | +so messages may appear wrapped as e.g. `"{{…}}"`. |
| 182 | + |
| 183 | +This option is not chosen due to adding an excessive |
| 184 | +quoting burden on all messages. |
| 185 | + |
| 186 | +### Start in text, encapsulate code, re-encapsulate text within code |
| 187 | + |
| 188 | +As in the proposed design, simple patterns are unquoted. |
| 189 | +Patterns in messages with statements, however, |
| 190 | +are required to always be surrounded by `{{…}}` or some other delimiters. |
| 191 | + |
| 192 | +This effectively means that some syntax will "enable" code mode for a message, |
| 193 | +and that patterns in such a message need delimiters. |
| 194 | + |
| 195 | +This option is not chosen due to adding an excessive |
| 196 | +quoting burden on all multi-variant messages, |
| 197 | +as well as introducing an unnecessary additional conceptual layer to the syntax. |
| 198 | + |
| 199 | +### Start in text, encapsulate code, trim minimally |
| 200 | + |
| 201 | +This is the same as the proposed design, |
| 202 | +but with a different trimming rule: |
| 203 | + |
| 204 | +- Trim all spaces before and between declarations. |
| 205 | +- For single-variant messages, trim one newline after the last declaration. |
| 206 | +- For multivariant messages, |
| 207 | + trim one space after a `when` statement and |
| 208 | + one newline followed by any spaces before a subsequent `when` statement. |
| 209 | + |
| 210 | +This option is not chosen due to the quoting being too magical. |
| 211 | +Even though this allows for all patterns with whitespace to not need quotes, |
| 212 | +the cost in complexity is too great. |
| 213 | + |
| 214 | +### Start in text, encapsulate code, trim maximally |
| 215 | + |
| 216 | +This is the same as the proposed design, |
| 217 | +but with a different trimming rule: |
| 218 | + |
| 219 | +- Trim all leading and trailing whitespace for each pattern. |
| 220 | + |
| 221 | +Expressing the trimming on patterns rather than statements |
| 222 | +means that leading and trailing spaces are also trimmed from simple messages. |
| 223 | +This option is not chosen due to this being somewhat surprising, |
| 224 | +especially when messages are embedded in host formats that have predefined means |
| 225 | +of escaping and/or trimming leading and trailing spaces from a value. |
| 226 | + |
| 227 | +### Start in text, encapsulate code, do not trim |
| 228 | + |
| 229 | +This is the same as the proposed design, |
| 230 | +but with two simplifications: |
| 231 | + |
| 232 | +- No whitespace is ever trimmed. |
| 233 | +- Quoting a pattern with `{{…}}` is dropped as unnecessary. |
| 234 | + |
| 235 | +With these changes, |
| 236 | +all whitespace would need to be explicitly within the "code" part of the syntax, |
| 237 | +and patterns could never be separated from statements |
| 238 | +without adding whitespace to the pattern. |
0 commit comments