Escaping: escaping when a message is stored in a general purpose container #236

mihnita · 2022-05-11T20:19:00Z

No description provided.

zbraniecki · 2022-05-11T20:50:06Z

What's the problem here?

aphillips · 2022-05-11T21:06:02Z

Is this asking what the escape syntax of MF is?

I presume that the storage format's escaping is removed by that the format's reader, e.g. a Java program storing the pattern as a String would interpret \u20ac into a Euro sign before MF ever got to see it.

mihnita · 2022-05-12T19:24:38Z

Comments migrated from the slides

Slides comment, Mihai Nita (@mihnita), 11:51 AM Apr 21

Are the escapes identical in literals identical to the ones inside placeholders?

If yes, why? Saying "yes" can drag in unnecessary implications.

Slides comment, Eemeli Aro (@eemeli), 12:37 PM Apr 21

Could you clarify what you mean by "unnecessary implications"?

Slides comment, Mihai Nita (@mihnita), 12:30 PM Apr 23

Makes escaping more complex.
Not so much in parsing, but for the regular user writing the message.

If the escape rules depend on the current state it means they don't "leak" and create noise outside that state.

Take for example the MF1.
...{exp,date,"MMM d 'at' h:mm a"} ...

The pattern inside the placeholder is passed to MessageFormat. Which requires the ' around 'at' (otherwise "a" is interpreted as an am/pm field)
And then there are rules on how to escape the ' for DateFormat.

That requirement to escape the ' inside the placeholder "leaked" outside, requiring doubling the ' in the plain text part.
Which was a continuous PITA.

===

In HTML there is no good reason to escape " and ' in plain text, and there is no good reason to escape < and > in the values of the attributes.

So most browsers will even ignore those rules and to the right and intuitive thing:

<p> Don't "escape" things, <img src="..." alt="No need to escape < or > in here">!</p>

Even escaping & is often unnecessary.

===

It it easier to read / write a message (as a human, not a machine) if the escape rules are limited by scope (in text, in a literal, values of the options, maybe selector keys, etc).

Take a literal:
This {"OK to use { here", ...}...
There is no need to escape { in the literal.

And a {lst, listFormat, start="{" end="}" sep=", "} items.

There is no need to escape { } in the values of the options.

Or maybe in the future a custom function:

{count :ranges
"=1" {a single egg}
"[1, 20)" {a bunch of eggs}
"[20, 100)" {a lot of eggs}
_ {countless eggs}
}

There is no good reason to escape "[" inside the key.

===

"Global" escaping rules produce unnecessary "noise escapes"

Slides comment, Eemeli Aro (@eemeli), 2:25 AM Apr 24

Thank you for the clarification. I agree that e.g. our need to escape \" inside a "quoted literal" should by no means force all " to be escaped outside quotes literals.

Conversely, do you see harm in allowing \" escapes outside quoted literals? Meaning that in the text part of a message, " and \" would both be recognised as representations of the same character.

Slides comment, Mihai Nita (@mihnita), 4:09 PM Apr 24

Accepting both \" and " seems sloppy.
Yes, a parser can be tolerant and accept both.
But I don't think it is not good design to have that in the syntax.

Slides comment, Mihai Nita (@mihnita), 4:32 PM Apr 24

... Removed first part of the comment, talking about escaping [ ]

...

That's also the reason you don't want to mix escaping conventions, and try (as much as possible) to leave that to the storage.

Imagine a properties file where some strings require 䔳 escape, some support é some require \u4533, and some \u{4533}.

Depending on what API is used on that string after loading.

mihnita · 2022-05-12T19:34:38Z

Comments migrated from the slides

Slides comment, Mihai Nita (@mihnita), 11:47 AM Apr 21

\uXXXX, \n and \t don't belong here.
They are specific to the "container format"

For the "in memory syntax" they should be already resolved.

Slides comment, Eemeli Aro (@eemeli), 5:50 AM Apr 22

You may be right; dropped them from here. We'll still need at least \u and \U within {"quoted literals"}.

Slides comment, Mihai Nita (@mihnita), 11:01 AM Apr 23

Even there, I don't think so. These are usually resolved by the "storage layer"

When you access the dom in JS the ओ is gone. \uXXXX is gone at runtime in Java properties, and in C strings. And so on.

This is usually solved by compiler (C/C++, Java code) or loading (HTML, Java properties).

First item in my "MessageFormat syntax: requirements / thoughts [MIH]":
"Don’t mix syntax concerns with serialization storage concerns"

And there is a reason why that's first.

mihnita · 2022-05-16T02:41:13Z

In this bug I've only captured the bullets as documented by Stas after the slides.
So I don't want to mess up the "description" with my opinion.

For my take you can check my "MessageFormat syntax: requirements / thoughts" doc, that I've shared before the syntax was proposed.

Here is a copy-paste for convenience (sorry, it is a bit long)

Don’t mix syntax concerns with serialization storage concerns

What do I mean by this?

Design the syntax that is passed in the in-memory string to the message format parser.

Similar to C/C++/Java handling of \ (in \0303\0251, \xC3\xA9, \x00e9, \n, \t, \\).
Or HTML handling of é (named entities) / é (numeric entities).

When the content is loaded and in memory, these are already gone, replaced by the proper characters.

It means that there should not be any escaping in the plain text elements of MessageFormat except for the character that marks the beginning of a placeholder.

The fact that a Unicode character is \u00E9 or é or é or is storage serialization concern.

I should use whatever the convention is for the storage I use (.properties, .xml, strings in code that I extract with gettext)
Same for \n, \r, \t, \\.

Mixing concerns means we end up with a mess where translators are supposed to care about escaping even after the string was extracted from the storage format.

And we end up with double encoding (so we need \\\\ or &#x6541;).

Related, new lines converted to space, collapsing spaces to one space, left/right trimming spaces, indents, these are all storage serialization concerns.

eemeli · 2022-05-16T09:22:54Z

The current escape rules in the proposed spec are:

message-format-wg/spec/message.ebnf

Lines 54 to 57 in fe595d5

    
           /* Escape sequences */ 
        
           Esc ::= '\' 
        
           TextEscape ::= Esc Esc | Esc '[' | Esc ']' | Esc '{' | Esc '}' 
        
           StringEscape ::= Esc Esc | Esc '"'

Is there something that should be added to or removed from these rules, or could this issue be closed?

romulocintra · 2022-07-18T17:04:03Z

Related issues #255 #276

Consensus : we are ok with the actual set of Escaping

mihnita added the syntax Issues related with syntax or ABNF label May 11, 2022

mihnita mentioned this issue May 12, 2022

Escaping: do we need Unicode escape sequences? #234

Closed

romulocintra closed this as completed Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaping: escaping when a message is stored in a general purpose container #236

Escaping: escaping when a message is stored in a general purpose container #236

mihnita commented May 11, 2022

zbraniecki commented May 11, 2022

aphillips commented May 11, 2022

mihnita commented May 12, 2022 •

edited

Loading

mihnita commented May 12, 2022

mihnita commented May 16, 2022

eemeli commented May 16, 2022

romulocintra commented Jul 18, 2022

Escaping: escaping when a message is stored in a general purpose container #236

Escaping: escaping when a message is stored in a general purpose container #236

Comments

mihnita commented May 11, 2022

zbraniecki commented May 11, 2022

aphillips commented May 11, 2022

mihnita commented May 12, 2022 • edited Loading

Comments migrated from the slides

mihnita commented May 12, 2022

Comments migrated from the slides

mihnita commented May 16, 2022

eemeli commented May 16, 2022

romulocintra commented Jul 18, 2022

mihnita commented May 12, 2022 •

edited

Loading