Delimiting patterns: `[ ... ]` vs `{ ... }` #255

mihnita · 2022-05-11T20:20:45Z

… please use curly braces for those delimiters; square brackets are common enough in real text. Together with using curly braces for placeholders, only {} are special and need escaping.

{Hello world!}
{Hello {$name}!}

The text was updated successfully, but these errors were encountered:

mihnita · 2022-05-12T19:38:24Z

Comments migrated from the slides

Options for consideration

Slides comment, Stanisław Małolepszy (@stasm), 11:10 PM Apr 21

(Forked from another comment) @mihnita wrote:

Why [ ] and not { }? Cases A and B (maybe also C and D) would work just fine with { } instead of [ ]

I suggested [ ] because I'd like to make it clear which delimiters are for code, and which are for text. It's always bothered me that ICU MF used { } to mean both of these things.

With [ ], it becomes instantly clear which text is translatable. This is particularly important when the source language is English, because then the source text is written in the same language as the names of functions, options, option values, and CLDR grammar features are.

With { }, I agree it's still possible to make it unambiguous to parsers, but I hypothesize that the cognitive cost for humans is greater than the risk of having to escape [ ].

Slides comment, Mihai Nita (@mihnita), 2:03 PM Apr 21 (edited 2:05 PM Apr 21)

Note for A and B: why [ ] and not { }?
Pros, cons?
Yes, less { } nesting, maybe a bit more readable.
But makes [ ] escapable. And [ ] is more commonly used than { }

Any extra thing with special meaning needs a justification.

stasm · 2022-05-23T16:19:28Z

As I stated in the comment reproduced above, I have a strong preference for using square brackets [...] for delimiting translatable text. Moreover, I think it would be a mistake to specifically use curly braces {...} for this purpose.

From the point of view of discoverability of syntax rules, I'd argue that it's easier to teach users that { means "open a code block" rather than that it means "delimit code or text, depending on the context".

For example, switching to {...} and using our current develop syntax, I find it confusing to imagine how the following message would parse in users' heads:

{$userName :gender}
    masculine {{$userName} added a new photo to his album.}
    feminine {{$userName} added a new photo to her album.}
    _ {{$userName} added a new photo to their album.}

This is made worse by the fact that the CLDR names like masculine, feminine, the names of variables like $userName, and lastly the names of functions like :gender are all most likely going to be written in English. The same language that a majority of source translatable content will be in.

This is why I'm advocating that one of our design goals be to make it as obvious as possible which parts of the message are meant to be translated. I'm proposing that we use square brackets to that effect:

{$userName :gender}
    masculine [{$userName} added a new photo to his album.]
    feminine [{$userName} added a new photo to her album.]
    _ [{$userName} added a new photo to their album.]

markusicu · 2022-05-31T00:27:53Z

I think that [] are common enough in regular text that having to escape them would be more cumbersome than it's worth.
I don't think that using {} for delimiters of both patterns and placeholders is overly confusing. That's not been a pain point with ICU MessageFormat, for example.
The [] lend themselves much more for enclosing selectors and selection keys.

stasm · 2022-05-31T17:34:33Z

I think that [] are common enough in regular text that having to escape them would be more cumbersome than it's worth.

I've been trying to collect some data on this. I estimate that fewer than 0.5% of messages contain square brackets in English and many other European languages. The percentage goes up for Japanese and Chinese (~1.5%) because these languages use square brackets as regular punctuation marks.

I think it's fair to assume that brackets are on average more common than curly braces ({}), but less common than round parentheses (()). At which point do we decide that they're common enough?

Also, we might want to consider which languages are most common as source languages, as they will be edited by hand more often than others. OTOH, target languages are more likely to be edited via tooling, in which case escaping requiremets matter less.

I don't think that using {} for delimiters of both patterns and placeholders is overly confusing. That's not been a pain point with ICU MessageFormat, for example.

I can only offer my personal experience as well as some anecdotal data back from my Mozilla days. I've always found it very confusing that ICU MF used braces for both code and text. This was perhaps made harder for me because ICU MF also didn't use any prefixes for argument names nor functions, which could result in messages like {count, plural, one {One thing.} other {{count, number} things.}}, where every single piece of code is a valid English word, and there are no cues as to what is localizable and what is code.

aphillips · 2022-05-31T18:06:41Z

In my experience, any requirement for escaping in the plain text is a source of error. Translators and other non technical people are not prone to get it right and tooling only gets us so far.

I agree that the MF keywords provoke overtranslation and protection of peaceables is a constant worry.

I do prefer that in-text placeables look as codelike as possible and, as mentioned, that escapes be kept to an absolute minimum. The outer "quotes" don't matter so much (translators and others mostly won't see them) but using square brackets would require them to be escaped in line. Hence a preference for curly brackets.

macchiati · 2022-05-31T18:26:55Z

I echo "Translators and other non technical people are not prone to get it right and tooling only gets us so far."

…

On Tue, May 31, 2022 at 11:06 AM Addison Phillips ***@***.***> wrote: In my experience, any requirement for escaping in the plain text is a source of error. Translators and other noob technical people are not prone to get it right and tooling only gets us so far. I agree that the MF keywords provoke overtranslation and protection of peaceables is a constant worry. I do prefer that in-text placeables look as codelike as possible and, as mentioned, that escapes be kept to an absolute minimum. The outer "quotes" don't matter so much (translators and others mostly won't see them) but using square brackets would require them to be escaped in line. Hence a preference for curly brackets. — Reply to this email directly, view it on GitHub <#255 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFH4UQSZM36W5BMMPTVMZIL7ANCNFSM5VWCXF7A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

stasm · 2022-06-01T15:49:44Z

The outer "quotes" don't matter so much (translators and others mostly won't see them) but using square brackets would require them to be escaped in line. Hence a preference for curly brackets.

If we assume that translators and others mostly won't see the outer delimiters (which I agree with), it would be helpful from the tooling point of view to be able to also assume that a literal delimiter typed by a translator is meant to be part of the translation, and hence requires escaping when serialized to MF2 syntax.

This is the case for square brackets [] as pattern delimiters, because they have no other syntactical meaning. Does it also hold for the curly braces {}, which can also be used inside patterns to delimit placeholders?

Consider a message [Hello], using [] to delimit patterns:

A translator will see it as Hello.
If they now type Hello [square, the tooling can serialize this as [Hello \[square], because any square bracket typed by a translator is guaranteed to not be a pattern delimiter.

OTOH, consider a message {Hello}, using {} to delimit patterns:

A translator will see it as Hello.
If they now type Hello {curly, this actually looks like an unclosed placeholder. For someone familiar with MF2 syntax (and not relying only on CAT's editing UI) this could be a source of confusion, and could in fact lead to spurious and redundant escaping by hand.

markusicu · 2022-06-07T22:51:55Z

If they now type Hello {curly, this actually looks like an unclosed placeholder.

Only to a developer editing the message string, and they need to learn which characters may be or must be escaped.

A translator normally works in a tool that hides as much syntax as possible, so if they type any syntax-relevant character, the tool must escape it.

For reasons stated here and echoed by Addison and Mark, I favor curly braces to enclose patterns pretty strongly.

I also favor using square brackets for selector lists and variant key lists. If we could agree on syntax that is just as easy to parse and read but without square brackets for those lists, then maybe they could be ok for enclosing patterns. But I don't think that they are better for enclosing patterns than curly braces.

stasm · 2022-06-09T07:46:26Z

For reasons stated here and echoed by Addison and Mark, I favor curly braces to enclose patterns pretty strongly.

I see and acknowledge the agreement forming about using curly braces, but I'd like someone to address the concerns I had about mixing translatable and non-translatable parts of the message by using the same syntax for them.

So far we're in a situation where one group says "we don't think it's overly confusing", to which the other group responds "indeed, it is confusing to us and we've seen others confused as well". How do we get out of this?

I think we should take a step back and:

consider who the primary audience and authors of raw-syntax messages will be,
evaluate how much tooling and CAT tools can help, and in particular, what it cannot do,
decide what the acceptable cost of avoiding vs. requiring escaping is, i.e. what is the "absolute minimum" of escaping that we're not willing to go over, and why.

mihnita · 2022-06-09T17:22:36Z

consider who the primary audience and authors of raw-syntax messages will be

I would say: developers.

I see all the time developers using raw formats (strings and layout files in Android, html in other cases) even when a friendlier WYSIWYG tool is available.

And I include in this "bucket" technical people that contribute translations to a small project
For example if I decide to contribute a Romanian translation to an open source project I might edit raw files.
Because I am still "a developer who speaks Romanian"

I actually did that before,
And it breaks down fast.

As a developer I focus mainly on on the code: algorithms, the logic, security, correctness, etc.
Adding a string is < 1% of my time.

But translation (even as a dev contributing to open source) is 100% time job.
Even if I do it once, for 4 hours, it is all about that.
And you quickly discover that you want some spell-checking, for example. Which you (often) don't get in raw editing.

Similar to "hey, I can write a quick and dirty script in Notepad".
But for anything more serious I need to go to an IDE.

decide what the acceptable cost of avoiding vs. requiring escaping is, i.e. what is the "absolute minimum" of escaping that we're not willing to go over, and why.

I strongly-strongly advocate that translators (when working inside a translation tool) should not have to do any escaping. WYSIWYG.
I should type Say "hi"!, and it is not my job to escape the " as \" or "

mihnita · 2022-06-09T17:25:50Z

Reminder: for a refresher on localization tools and escaping (and other concerns) see the "Localization Concepts" document that I've shared a while ago.

mihnita · 2022-06-09T17:32:49Z

If they now type Hello {curly, this actually looks like an unclosed placeholder.

Translators should not type placeholder names.

The placeholders come from the source (or in some cases a dev can add info saying "btw, theses are 5 extra placeholders available to you)
And for the translator they are read only boxes that they can add / move around / remove.

So they choose "insert placeholder" (from a menu or hotkey) they get a list with what is available.
Or the placeholders are there already, populated from the source, and they "translate around" them.

If they type "Hello {curly" then it is a correct string, WYSIWYG, and when exported from the 10n tool is escape as "Hello \{curly" (or whatever we choose as escaping rule)

stasm · 2022-06-24T12:51:11Z

If they type "Hello {curly" then it is a correct string, WYSIWYG, and when exported from the 10n tool is escape as "Hello {curly" (or whatever we choose as escaping rule)

Right, I agree with this. I was just trying to make a point that even if it works, it may look like a broken placeholder to a subset of translators who cared enough to learn that {...} are used for placeholders. or maybe they use a CAT tool which displays the curly braces around placeholders, even if it makes them not editable? However, I take this argument back: I think we can assume that typing raw curly braces is a rather rare use-case, and we don't need to optimize it for it.

stasm · 2022-06-24T14:31:58Z

I thought of one more risk associated with choosing {...} for pattern delimiters.

The current spec only allows placeholder expresions as local variables / named expressions.

$myCount = {$count :number maxFractionDigits=0}

However, we did talk inthe past about extending this to allowing whole patterns to be bound to local variables. See #149 for reference and the examples in my January proposal. For example:

$title = [Let's go, {$username}!]
[{button title=$title}Continue{/button}]

I acknowledge that such a feature is controversial: it creates risks related to concatenation, but it also allows things that doesn't currently have good built-in alternatives in MF. During the discussions with the CLDR TC, this idea was removed from the scope of MF 2.0, but I could see us revisiting it in the future.

If we choose the same delimiter for patterns as we do for placeholders, we'll have a syntax conflict:

$myCount = {$count :number maxFractionDigits=0}
$title = {Let's go, {$username}!}

(This may be an argument to rethink the local variable definition syntax, too, BTW).

I realize that this is a hypotethical risk. My point here is that by being more explicit and choosing different delimiters for different things, we'd make the MF syntax more resilient in the future, and more flexible for future extensions. I fact, I now realize that the same is true for my arguments in #286. We can find a minimal set of syntax productions that parse and work well for our current scope, but we also increase the risk of not being able to extend the syntax in the future by doing so.

stasm · 2022-06-24T15:21:03Z

Arguments against [...] as pattern delimiters:

More common than {...}, ergo, need to escape more frequently. In particular, a problem in Japanese and Chinese.
Two character sets now need escaping, curly braces (b/c of placeholders) and square brackets. This is harder for the users to remember and to get right.

Arguments in favor of [...] as pattern delimiters:

Establishes a visual separation about what's translatable (stuff in square brackets), and what's not (stuff in curly braces), which may help some users (developers authoring translations) learn the syntax.
More resilient to future extensions of the syntax, if any (see my comment above).

May be more friendly to recovering from syntax errors, because it's easier to deduce where the error happened. Consider:

  one [aaa {1} few [bbb] * [ccc]
              ^ A missing closing ].
                   ^ The opening [ here starts a new pattern; a parser can start recovery.

  one {aaa {1} few {bbb} * {ccc}
              ^ A missing closing }.
                   ^ This looks like a regular placeholder to the parser.
                                    ^ No closing }; the entire message is junk.

I understand that not all of these arguments are equal. I also acknowledge that they are all correct and valid, and that we're in fact discussing which consequences we're willing to accept.

echeran · 2022-06-27T04:11:10Z

My extra arguments against [...] as pattern delimiters:

The examples using [...] for patterns don't address delimiting selector definitions and selector value tuple cases. It would be nice to use [...] for delimiting the sequence of selectors and the sequences of value tuples since [...] is common syntax for sequential data.

My comment about the 3rd argument above in favor of [...]is similar to what you acknowledge for the 2nd such argument -- its need is debatable. For the 3rd argument, we still don't have a response that thinks through alternatives & pros / cons for the unstated problem that syntax errors is supposed to solve, as the CLDR-TC+ICU-TC response recorded:

Recoverability for bad syntax
a. General consensus among attendees (but not everyone could be here) is that we are better off making sure that the messages are well formed, rather than complicating the parser. Haven’t seen a use case where recoverability within the parser is essential (as opposed to validation/recovery in localization tools). Also we haven’t had a compelling example where a change in syntax would “improve” recoverability.
b. However, not everyone could be there, so leave this open to further investigation for MF2.0.

If we consider the alternative of "validation/recovery in localization tools" that would catch and prevent such errors before happening during runtime (which after the fact), we can also consider that we can independently iterate on error reporting independently of the syntax. For example, in your hypothetical syntax error output in the [...] above, I'm not sure how a parser would know that the token few was indeed a selector tuple and not a part of the previous pattern. Again, this is related to the selector value tuple case delimiter need. And if parsers can do better (and I suspect that would also be possible in the hypothetical {...} case), that can happen over time by people working on implementations regardless of the choice here.

stasm · 2022-06-27T07:19:27Z

My extra arguments against [...] as pattern delimiters:

The examples using [...] for patterns don't address delimiting selector definitions and selector value tuple cases. It would be nice to use [...] for delimiting the sequence of selectors and the sequences of value tuples since [...] is common syntax for sequential data.

This isn't an argument against using [...] as pattern delimiters, but instead an argument for using [...] to delimit variant keys. However, if square brackets work well someplace else, we should use them someplace else, and find a different syntax for variant keys.

Also, arguably, a pattern is sequential data too: it's an ordered sequence of pattern parts.

My comment about the 3rd argument above in favor of [...]is similar to what you acknowledge for the 2nd such argument -- its need is debatable. For the 3rd argument, we still don't have a response that thinks through alternatives & pros / cons for the unstated problem that syntax errors is supposed to solve, as the CLDR-TC+ICU-TC response recorded:

Recoverability for bad syntax
a. General consensus among attendees (but not everyone could be here) is that we are better off making sure that the messages are well formed, rather than complicating the parser. Haven’t seen a use case where recoverability within the parser is essential (as opposed to validation/recovery in localization tools). Also we haven’t had a compelling example where a change in syntax would “improve” recoverability.
b. However, not everyone could be there, so leave this open to further investigation for MF2.0.

I'm using recoverability from errors as one potential benefit of having a syntax with good separation of concerns, clear rules, no overloading, and no exceptions. From the note that you highlighted, it sounds like we should talk about this more for the purpose of MF2.0?

For example, in your hypothetical syntax error output in the [...] above, I'm not sure how a parser would know that the token few was indeed a selector tuple and not a part of the previous pattern.

The few variant in my example is probably lost as well, but * [ccc] can be salvaged.

Again, this is related to the selector value tuple case delimiter need.

I think it's only related if we add one more syntax character after the opening {, kind of like #277 was suggesting. Otherwise, even delimited keys will look like either part of the pattern (since the stated goal is to not have to escape their delimiters), or like another placeholder (if we use {}).

And if parsers can do better (and I suspect that would also be possible in the hypothetical {...} case), that can happen over time by people working on implementations regardless of the choice here.

Well, some things will simply not be possible given any kind of syntax, and our choices influence how many such things there are.

stasm · 2022-06-27T18:59:42Z

During the plenary meeting today, I acknowledged that the majority of the WG is in favor of using {...} to delimit patterns. Even though I still think there are merits to using different syntax to denote different things, I also see the benefits of removing the need to escape one additional class of delimiters inside patterns, and only making {} be special.

I think that the discussion about keywords in #286 and the proposed implementation in #287 have made it easier for me to accept {...} as pattern delimiters. In my opinion, the keywords improve the readability of the syntax to a point where it's hopefully easy enough to understand what's going on. In fact, I don't necessarily see the need to use square brackets in the syntax at all, not even to delimit selector and variant keys tuples (#253).

The consensus reached during the meeting was to:

Wait for @aphillips's approval in Use keywords in syntax #287.
Open a PR to change pattern delimiters from [...] to {...}. (I'll do it.)
In order to unblock the tech preview, keep variant keys not delimited (same as right now), on the assumption that (a) multiple selectors are rare, and that (b) the we prefer consistency between the syntax for one selector and multiple selectors over creating alternate syntaxes that coexist. This allows us to remove square brackets from the set of syntax characters entirely.

Closes unicode-org#255.

* Use curly braces to delimit patterns Closes #255. * Reword "nested" to "inner" Make it clearer that there's no deep nesting possible. Co-authored-by: Eemeli Aro <[email protected]> Co-authored-by: Eemeli Aro <[email protected]>

mihnita added the syntax Issues related with syntax or ABNF label May 11, 2022

stasm mentioned this issue May 12, 2022

Add syntax proposal with EBNF #230

Merged

romulocintra added the blocker Blocks the release label May 16, 2022

romulocintra added this to the Technical Preview milestone May 16, 2022

stasm mentioned this issue May 23, 2022

Start in "code" mode exclusively #256

Closed

stasm mentioned this issue Jun 1, 2022

Variants: should variant keys be delimited, too? #253

Closed

eemeli mentioned this issue Jun 22, 2022

Use keywords in syntax #287

Merged

echeran mentioned this issue Jun 23, 2022

Drop restriction on using keywords in the syntax #286

Closed

stasm added a commit to stasm/message-format-wg that referenced this issue Jun 28, 2022

Use curly braces to delimit patterns

8f424cc

Closes unicode-org#255.

stasm mentioned this issue Jun 28, 2022

Use curly braces to delimit patterns #288

Merged

romulocintra closed this as completed Jul 18, 2022

romulocintra mentioned this issue Jul 18, 2022

Escaping: escaping when a message is stored in a general purpose container #236

Closed

echeran mentioned this issue Feb 17, 2023

Add explicit whitespace definitions #344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delimiting patterns: `[ ... ]` vs `{ ... }` #255

Delimiting patterns: `[ ... ]` vs `{ ... }` #255

mihnita commented May 11, 2022 •

edited by stasm

Loading

mihnita commented May 12, 2022 •

edited

Loading

stasm commented May 23, 2022 •

edited

Loading

markusicu commented May 31, 2022

stasm commented May 31, 2022

aphillips commented May 31, 2022 •

edited

Loading

macchiati commented May 31, 2022 via email

stasm commented Jun 1, 2022

markusicu commented Jun 7, 2022

stasm commented Jun 9, 2022

mihnita commented Jun 9, 2022

mihnita commented Jun 9, 2022

mihnita commented Jun 9, 2022

stasm commented Jun 24, 2022

stasm commented Jun 24, 2022 •

edited

Loading

stasm commented Jun 24, 2022 •

edited

Loading

echeran commented Jun 27, 2022

stasm commented Jun 27, 2022

stasm commented Jun 27, 2022

Delimiting patterns: [ ... ] vs { ... } #255

Delimiting patterns: [ ... ] vs { ... } #255

Comments

mihnita commented May 11, 2022 • edited by stasm Loading

mihnita commented May 12, 2022 • edited Loading

Comments migrated from the slides

stasm commented May 23, 2022 • edited Loading

markusicu commented May 31, 2022

stasm commented May 31, 2022

aphillips commented May 31, 2022 • edited Loading

macchiati commented May 31, 2022 via email

stasm commented Jun 1, 2022

markusicu commented Jun 7, 2022

stasm commented Jun 9, 2022

mihnita commented Jun 9, 2022

mihnita commented Jun 9, 2022

mihnita commented Jun 9, 2022

stasm commented Jun 24, 2022

stasm commented Jun 24, 2022 • edited Loading

stasm commented Jun 24, 2022 • edited Loading

echeran commented Jun 27, 2022

stasm commented Jun 27, 2022

stasm commented Jun 27, 2022

Delimiting patterns: `[ ... ]` vs `{ ... }` #255

Delimiting patterns: `[ ... ]` vs `{ ... }` #255

mihnita commented May 11, 2022 •

edited by stasm

Loading

mihnita commented May 12, 2022 •

edited

Loading

stasm commented May 23, 2022 •

edited

Loading

aphillips commented May 31, 2022 •

edited

Loading

stasm commented Jun 24, 2022 •

edited

Loading

stasm commented Jun 24, 2022 •

edited

Loading