Skip to content

Escaping: escaping when a message is stored in a general purpose container #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mihnita opened this issue May 11, 2022 · 7 comments
Closed
Labels
syntax Issues related with syntax or ABNF

Comments

@mihnita
Copy link
Collaborator

mihnita commented May 11, 2022

No description provided.

@mihnita mihnita added the syntax Issues related with syntax or ABNF label May 11, 2022
@zbraniecki
Copy link
Member

What's the problem here?

@aphillips
Copy link
Member

Is this asking what the escape syntax of MF is?

I presume that the storage format's escaping is removed by that the format's reader, e.g. a Java program storing the pattern as a String would interpret \u20ac into a Euro sign before MF ever got to see it.

@mihnita
Copy link
Collaborator Author

mihnita commented May 12, 2022

Comments migrated from the slides


Slides comment, Mihai Nita (@mihnita), 11:51 AM Apr 21

Are the escapes identical in literals identical to the ones inside placeholders?

If yes, why? Saying "yes" can drag in unnecessary implications.


Slides comment, Eemeli Aro (@eemeli), 12:37 PM Apr 21

Could you clarify what you mean by "unnecessary implications"?


Slides comment, Mihai Nita (@mihnita), 12:30 PM Apr 23

Makes escaping more complex.
Not so much in parsing, but for the regular user writing the message.

If the escape rules depend on the current state it means they don't "leak" and create noise outside that state.

Take for example the MF1.
...{exp,date,"MMM d 'at' h:mm a"} ...

The pattern inside the placeholder is passed to MessageFormat. Which requires the ' around 'at' (otherwise "a" is interpreted as an am/pm field)
And then there are rules on how to escape the ' for DateFormat.

That requirement to escape the ' inside the placeholder "leaked" outside, requiring doubling the ' in the plain text part.
Which was a continuous PITA.

===

In HTML there is no good reason to escape " and ' in plain text, and there is no good reason to escape < and > in the values of the attributes.

So most browsers will even ignore those rules and to the right and intuitive thing:

<p> Don't "escape" things, <img src="..." alt="No need to escape < or > in here">!</p>

Even escaping & is often unnecessary.

===

It it easier to read / write a message (as a human, not a machine) if the escape rules are limited by scope (in text, in a literal, values of the options, maybe selector keys, etc).

Take a literal:
This {"OK to use { here", ...}...
There is no need to escape { in the literal.

And a {lst, listFormat, start="{" end="}" sep=", "} items.

There is no need to escape { } in the values of the options.

Or maybe in the future a custom function:

{count :ranges
"=1" {a single egg}
"[1, 20)" {a bunch of eggs}
"[20, 100)" {a lot of eggs}
_ {countless eggs}
}

There is no good reason to escape "[" inside the key.

===

"Global" escaping rules produce unnecessary "noise escapes"


Slides comment, Eemeli Aro (@eemeli), 2:25 AM Apr 24

Thank you for the clarification. I agree that e.g. our need to escape \" inside a "quoted literal" should by no means force all " to be escaped outside quotes literals.

Conversely, do you see harm in allowing \" escapes outside quoted literals? Meaning that in the text part of a message, " and \" would both be recognised as representations of the same character.


Slides comment, Mihai Nita (@mihnita), 4:09 PM Apr 24

Accepting both \" and " seems sloppy.
Yes, a parser can be tolerant and accept both.
But I don't think it is not good design to have that in the syntax.


Slides comment, Mihai Nita (@mihnita), 4:32 PM Apr 24

... Removed first part of the comment, talking about escaping [ ]

...

That's also the reason you don't want to mix escaping conventions, and try (as much as possible) to leave that to the storage.

Imagine a properties file where some strings require &#x4533; escape, some support &eacute; some require \u4533, and some \u{4533}.

Depending on what API is used on that string after loading.

@mihnita
Copy link
Collaborator Author

mihnita commented May 12, 2022

Comments migrated from the slides


Slides comment, Mihai Nita (@mihnita), 11:47 AM Apr 21

\uXXXX, \n and \t don't belong here.
They are specific to the "container format"

For the "in memory syntax" they should be already resolved.


Slides comment, Eemeli Aro (@eemeli), 5:50 AM Apr 22

You may be right; dropped them from here. We'll still need at least \u and \U within {"quoted literals"}.


Slides comment, Mihai Nita (@mihnita), 11:01 AM Apr 23

Even there, I don't think so. These are usually resolved by the "storage layer"

When you access the dom in JS the &#2323; is gone. \uXXXX is gone at runtime in Java properties, and in C strings. And so on.

This is usually solved by compiler (C/C++, Java code) or loading (HTML, Java properties).

First item in my "MessageFormat syntax: requirements / thoughts [MIH]":
"Don’t mix syntax concerns with serialization storage concerns"

And there is a reason why that's first.

@mihnita
Copy link
Collaborator Author

mihnita commented May 16, 2022

In this bug I've only captured the bullets as documented by Stas after the slides.
So I don't want to mess up the "description" with my opinion.


For my take you can check my "MessageFormat syntax: requirements / thoughts" doc, that I've shared before the syntax was proposed.

Here is a copy-paste for convenience (sorry, it is a bit long)


Don’t mix syntax concerns with serialization storage concerns

What do I mean by this?

Design the syntax that is passed in the in-memory string to the message format parser.

Similar to C/C++/Java handling of \ (in \0303\0251, \xC3\xA9, \x00e9, \n, \t, \\).
Or HTML handling of &eacute; (named entities) / &#xE9; (numeric entities).

When the content is loaded and in memory, these are already gone, replaced by the proper characters.

It means that there should not be any escaping in the plain text elements of MessageFormat except for the character that marks the beginning of a placeholder.

The fact that a Unicode character is \u00E9 or &#x00E9; or &eacute; or is storage serialization concern.

I should use whatever the convention is for the storage I use (.properties, .xml, strings in code that I extract with gettext)
Same for \n, \r, \t, \\.

Mixing concerns means we end up with a mess where translators are supposed to care about escaping even after the string was extracted from the storage format.

And we end up with double encoding (so we need \\\\ or &amp;#x6541;).

Related, new lines converted to space, collapsing spaces to one space, left/right trimming spaces, indents, these are all storage serialization concerns.

@eemeli
Copy link
Collaborator

eemeli commented May 16, 2022

The current escape rules in the proposed spec are:

/* Escape sequences */
Esc ::= '\'
TextEscape ::= Esc Esc | Esc '[' | Esc ']' | Esc '{' | Esc '}'
StringEscape ::= Esc Esc | Esc '"'

Is there something that should be added to or removed from these rules, or could this issue be closed?

@romulocintra
Copy link
Collaborator

Related issues #255 #276

Consensus : we are ok with the actual set of Escaping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
syntax Issues related with syntax or ABNF
Projects
None yet
Development

No branches or pull requests

5 participants