Skip to content

Consider escaping by doubling the special characters #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stasm opened this issue Feb 14, 2023 · 9 comments
Closed

Consider escaping by doubling the special characters #346

stasm opened this issue Feb 14, 2023 · 9 comments
Labels
blocker-candidate The submitter thinks this might be a block for the next release resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with syntax or ABNF

Comments

@stasm
Copy link
Collaborator

stasm commented Feb 14, 2023

Text is defined such that backslashes in the text must be escaped, e.g. \u12AB is required to be \\u12AB or \n is required to be \\n. I'm okay with requiring { and } literals to be escaped.

I am nervous that we're recreating the apostrophe disaster in MF1 by requiring every backslash to be escaped. Developers use e.g. \n or other escapes in strings all the time. Making double escapes out of them doesn't seem to be required except to allow for the literal \{ or \} to appear in a string.

I think we might be better off by making the placeholder markers self-escaping, e.g. if you want to have a literal { use {{. { and } are rare enough in real text that this might be much less of an impact than requiring every backslash to be doubled.

Originally posted by @aphillips in #344 (comment)

@eemeli
Copy link
Collaborator

eemeli commented Feb 14, 2023

I think either this isn't a problem, or I'm missing something. Our base assumption is that by the time we're parsing a string with the MF2 syntax grammar, it's already been processed for character escapes by its container.

So for instance something like

new Intl.MessageFormat("{\u12AB}")

would rely on the JS string escape handling to present {ካ} as the message source, which would format as . Adding one more slash

new Intl.MessageFormat("{\\u12AB}")

would try to parse {\u12AB} and result in a syntax error, while

new Intl.MessageFormat("{\\\\u12AB}")

would get us to parse {\\u12AB}, and format that as \u12AB.

In other containers (e.g. .properties files) we would need the same amount of escaping, but in a dedicated MF2 resource syntax we could (and should!) define character sequences like \\ and \{ to be resource-level escapes for \\ and \{ rather than \ and {, so their slashes would not need to get doubled.

@aphillips
Copy link
Member

Maybe character escapes are not the best example. However, there are plenty of micro-syntaxes that use backslash as an escape. File level escapes such as \u might be removed by the deserializer, but some escapes need to reach the MF2 processor and, indeed, survive through that to reach the next level of processing.

It would be entirely reasonable to impose our escape regime just as currently written.

However developers, translators, tools, and such need to be able to easily write strings (in a myriad of programming languages and runtime environments) with a minimum of arcane rules. We don't provide any other escaping mechanism or reserve any other characters besides { and }. If we reserve \, though,developers (and translators and tools...) need to navigate any \ literals consumed post-MF2 (including escaping them for pre-MF2 serialization.

An alternative would be to only reserve escape-escape (\\) inside our constructs. The only need for \\ outside is to allow a literal sequence like:

new Intl.MessageFormat("\\{$foo}\\")

... in which the output looks something like \foo's value\. Hilariously I had to type four backslashes above because GH markdown reserves \ as an escape. If the output of my MF2 were markdown, I'd have to create a resource like: \\\\\\\\{$foo}\\\\\\\\ because:

  • \\\\\\\\ => \\\\ (removed by my storage layer)
  • \\\\ => \\ (removed by MF2)
  • \\ => \ (removed by my templating system, markdown in this case)

Am I being too paranoid?

@aphillips aphillips added syntax Issues related with syntax or ABNF blocker-candidate The submitter thinks this might be a block for the next release labels Feb 14, 2023
@eemeli
Copy link
Collaborator

eemeli commented Feb 14, 2023

A couple of incomplete thoughts:

  • We can get rid of at least the storage-level doubling by defining our own resource format, at least for its users.
  • If we're considering a structured target like markdown, which isn't really a flat string, the "right" approach ought to be built not around string output, but on some flavour of a formatted-parts sequence that could be transformed into markup, if necessary. That transformation would then know to e.g. double \ characters in text blocks to work around the templating system that's the next consumer of the formatted message.
  • We also need to escape ( and ) inside literals.

@aphillips
Copy link
Member

We also need to escape ( and ) inside literals.

Actually, we are only required to escape the closing character )

If we're considering a structured target like markdown, which isn't really a flat string, the "right" approach ought to be built not around string output, but on some flavour of a formatted-parts sequence that could be transformed into markup, if necessary. That transformation would then know to e.g. double \ characters in text blocks to work around the templating system that's the next consumer of the formatted message.

There are so many templating micro-syntaxes and resources serve these in so many different ways. My example of markdown as just because it was handy. I'm trying to be conservative about what we do in order to avoid developers and translators becoming confused. And I would be okay with keeping the status quo if others feel I'm being too paranoid.

@mihnita
Copy link
Collaborator

mihnita commented Feb 21, 2023

I don't even think we require escaping ) in message strings. If we do, we should not.

We require escaping only for "things that exit the current state"
So we should escape } in text, and ) in literals (placeholder names, parameters).
I even remember a long discussion on this, I can look for it.

Current EBNF:

Text ::= (TextChar | TextEscape)+
...
Literal ::= '(' (LiteralChar | LiteralEscape)* ')' /* ws: explicit */
...
Esc ::= '\'
TextEscape ::= Esc Esc | Esc '{' | Esc '}'
LiteralEscape ::= Esc Esc | Esc '(' | Esc ')'

@mihnita
Copy link
Collaborator

mihnita commented Feb 21, 2023

e.g. \u12AB is required to be \u12AB or \n is required to be \n.

They are currently not required to be escaped, not by MF2.
Yes, they might be required by the "previous layer" (code, or serialization):

String msg = "{The \u1234 and \n don't make it to MF2. But to protect curly bracket (\\{) or backslash (\\\\) we need escaping, I agree}";

What MF2 sees is "{The ሴ and
don't make it to MF2. But to protect curly bracket (\{) or backslash (\\) we need escaping, I agree}"


In other words: what Eemeli said :-)

@stasm
Copy link
Collaborator Author

stasm commented Feb 21, 2023

I don't even think we require escaping ) in message strings. If we do, we should not.

I think I remember some discussion about escaping both ( and ) to make it similar to how both { and } are escaped in text. However, perhaps we don't need to rehash this discussion if instead we agree to change the literal delimiter to something else than round parens. See #263.

We require escaping only for "things that exit the current state" So we should escape } in text, and ) in literals (placeholder names, parameters).

We also need to escape { in text. Otherwise it starts a placeholder.

@aphillips
Copy link
Member

The reason for escapes is to allow a syntax meaningful character to appear inside the syntax.

The literal end sigil needs to be escaped because the end sigil might need to appear in a literal, e.g.:

when (you have embedded (parentheticals\)) {You don't want to generate a syntax error}

The literal start sigil cannot appear anywhere as a starter where it is not already delimited by placeholder (in other words, { and }) or by the syntax itself (in other words, a when (literal) statement). This is moot if we adopt using | as both start and end. We can also provide it for "completeness" so that it is not a syntax error to escape the start sigil even though the escape is not necessary.

The other sigils { and } are used in a variety of our productions and both the start and end sigil need to be escapable to allow these characters to appear in text.

Note that { and } are protected inside a literal. Consider:

when ({$foo}) {this matches the string \{$foo\}, not the _value_ foo}

I raised this issue because, when we use backslash as the escape character, we also have to provide a way for it to appear in literals and text. And we know that programming languages and templating languages and all sorts of things use backslashes as escapes (it's why we're doing it).

My experience is that nesting formats are super common and quickly become super confusing. Our syntax will be embedded into different implementers resource format. It will also be used to store strings used by a myriad of developers in many different runtime environments. Each of these will have their own syntactical goo which developers will need to embed into strings.

We will no doubt cause some people pain when, like us, their templating language has chosen to make { and } into sigils. We can't help that. But if we can help avoid the need for developers to add more levels of escaping without greatly compromising our processing, we should look out for the opportunity.

I am willing to accept that we use \, but we need to be mindful of the impact on users.

@aphillips aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Jun 16, 2023
@aphillips
Copy link
Member

Closing per 2023-06-19 telecon discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker-candidate The submitter thinks this might be a block for the next release resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with syntax or ABNF
Projects
None yet
Development

No branches or pull requests

4 participants