Skip to content

Develop a resource syntax together with the message syntax #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eemeli opened this issue May 13, 2022 · 20 comments
Closed

Develop a resource syntax together with the message syntax #265

eemeli opened this issue May 13, 2022 · 20 comments
Labels
out-of-scope? resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with syntax or ABNF

Comments

@eemeli
Copy link
Collaborator

eemeli commented May 13, 2022

This is forked from #263, which went a bit off-topic but in a constructive way.

Quoting @aphillips from #263 (comment):

The discussion of a syntax without consideration for the serialization form makes me nervous that we'll have to embed our nifty new syntax in an impenetrable layer of syntactic goo from any eventual resource format. Maybe we should reconsider and just "bite the bullet" to define the "source format", which can then be consumed/compiled into a runtime format?

And @zbraniecki from #263 (comment):

I am aligned with you that we should draft the "MF2 Resource" proposal before we freeze MF2 Message Format.

@eemeli eemeli added the syntax Issues related with syntax or ABNF label May 13, 2022
@mihnita
Copy link
Collaborator

mihnita commented May 16, 2022

I think that if we design the syntax for MF2 to work nicely with the most popular formats out there (.xml / .html, .json, .properties, .strings, .rc, embedded in code / gettext) then we will have little friction with a MF2 Resource format.

Claiming otherwise is a red flag for me.
It means we want to design a resource format that is like nothing else out there.

So, should we have it in mind?
Sure.
But would not be a blocker / required before MF2 / sacrifice the compatibility with other formats kind of thing.

@markusicu
Copy link
Member

I feel pretty strongly that this WG, and the CLDR spec portion that it is tasked to produce, should not define a resource file format. This WG should define a data model, a syntax that is usable in many places, and a function registry.

There are lots of resource file formats, and database structures etc., defined by lots of organizations and projects to fit their needs and workflows. We will not replace existing formats, and we should not design something that requires a particular format.

We should focus on message strings that can be carried reasonably easily in a wide variety of such formats and systems.

@macchiati FYI

@zbraniecki
Copy link
Member

I feel pretty strongly that this WG (...) should not define a resource file format.

We are in agreement. This WG is scoped down to the per-message syntax and API.

The request in this issue is to recognize that a separate group should explore resource level syntax and the insight from that work should impact the design of this WG's work on per-message syntax.

In other words, I advocate against serialization of work where we would finalize per-message syntax and then start looking into resource syntax. I think such approach would miss an opportunity to inform per-message syntax with the needs of per-resource syntax and limit the quality of per-resource syntax in areas such as resilience, readability, recovery, meta information etc.

How much time this WG should give to receive an insight into per-resource one, I'm not sure yet, and I don't want to block on it indefinitely. I hope we can overlap those two workstreams and design reinforced synergy between them.

@eemeli
Copy link
Collaborator Author

eemeli commented Jun 7, 2022

Would it be appropriate to ask remit from the CLDR-TC to start a separate subgroup to discuss the resource syntax?

@zbraniecki
Copy link
Member

Would it be appropriate to ask remit from the CLDR-TC to start a separate subgroup to discuss the resource syntax?

I think it would be a good course of action.

I'm reluctant to push it further than advisory because I don't like asking for work that I do not have cycles to commit. I am not at the moment able to commit my time to work on the resource syntax, so I'm merely indicating that serializing those two steps is imho a recipe for a bad design on both sides.
I hope to secure some time to help, but if you have cycles to start it, I'd appreciate the conversation to begin now.

@mihnita
Copy link
Collaborator

mihnita commented Jun 8, 2022

I would think that if we keep in mind a rich set of existing formats (.properties, .json, .strings, .rc, xml, html, yaml, hard-coded strings & .po (gettext), maybe a few more), we should be fine.

I would find it a bit worrisome if some newly invented l10n format designed for mf2 introduces "revolutionary concepts" that don't already exist in the existing formats.
Note that hight level concepts like groups don't conflict with existing formats, xml, .rc, json, yaml, all support groups and subgroups.

That is independent from low-level concerns like "how do we escape newline"

@zbraniecki
Copy link
Member

I would think that if we keep in mind a rich set of existing formats (...), we should be fine.
I would find it a bit worrisome if some newly invented l10n format designed for mf2 introduces "revolutionary concepts" that don't already exist in the existing formats.

I disagree with that assessment for two reasons:

  1. Most of the formats you listed are 2-pass container formats - JSON, XML, YAML etc. are not l10n file formats. Similarly you can encode CSS rules in XML, JSON, TOML, YAML etc. but it will not expose any of the problems of handling CSS rules in CSS format.
  2. The formats that are actually 1-pass formats - gettext, .properties, .strings - are very simple compared to what we do in MF2.0. They don't have variants, selectors etc. They have very limited or none comments and meta-information. The placeholder handling is non-existing or very limited. The resilience, readability, and editability of such format is completely different and cannot be assumed to carry over to MF2.

Note that hight level concepts like groups don't conflict with existing formats, xml, .rc, json, yaml, all support groups and subgroups.

That's for the resource format WG to decide. Depending on how such resource format will decide to handle message storage, meta information storage, groups and relations it may be.

I'm particularly concerned by what I see as dismissal tone of the message in relation to my belief that this is a significant space to explore that should have ability to alter per-message syntax and thus should be explored prior to per-message syntax freeze.

@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

Then feel free to add other existing formats?

Yes, json / xml / yaml are not l10n file formats. In fact I would argue they are not file formats, they are "meta-formats"
In that they can store things, with structure, but the structure is to be defined.

So there are in fact l10n formats based on xml and json, if you want to split hairs.


I don't understand what you mean by one pass / 2 pass.
How are xml / json different than .properties? What stops one from doing json in one step?
Just the "laziness" of implementing a json parser from scratch?

are very simple compared to what we do in MF2.0. They don't have variants, selectors etc.

They have nothing to do with that?
Orthogonal concerns?
We should not mix storage format and content.
MF1 has selectors, and can be represented in .properties, gettext, .strings, etc.

... dismissal tone of the message in relation to my belief that ...

Sorry, that was not the intent.
But it would help if you can provide some example of what you envision that is not already seen before, in existing formats.

I find the suspect the idea of a storage syntax that "alter per-message syntax".
It might be great if we design a syntax for ECMAScript. Or for Android. Or for anything else.

But if we design a Unicode format that wants to be universally adopted then I find the idea suspect.

Especially since I don't see an example of what that would look like, other than "we have to wait and see"
That's why I might sound dismissive (but I don't intend to be)
I'm fine (up to a point) to say "let's delay declaring the syntax until we research that other thing"
But when someone asks budget for research they have so show what the direction of that research is, and what they can hope to uncover that was not known before.

So again: what kind of format you envision that is not already expressible in the existing formats?

@eemeli
Copy link
Collaborator Author

eemeli commented Jun 9, 2022

So again: what kind of format you envision that is not already expressible in the existing formats?

One important aspect that is not currently expressible in existing resource formats is the explicit association of a comment containing translator-relevant metadata with a message or a group of messages. There are certainly some common practices around this, but those practices are in fact for the most part against the specs of the underlying formats.

@zbraniecki
Copy link
Member

I don't understand what you mean by one pass / 2 pass.

In 1-pass file format, the parser parses the syntax of the resource and gets parsed messages directly. In 2-pass it first parses the container format (JSON, XML, TOML, YAML) and then retrieves messages that another parser parses.

Human interacting with a 2-pass format can introduce errors on either of two levels.

We should not mix storage format and content.

What is CSS format? storage or content?

So again: what kind of format you envision that is not already expressible in the existing formats?

The problem I see is not expressiveness. You can encode absolutely anything in JSON and XML.

The problem is how to create a human-readable/editable/writable resource format for MF2.

I believe that a group of 5 MF2 messages, with meta data, variants and multiline content, encoded in JSON will not be readable/editable/writable by a human.
I believe we should create such resource format and its syntax considerations should be taken into account when designing MF2 per-message syntax to enable such resource format to be designed well.

@aphillips
Copy link
Member

I think my understanding may have evolved somewhat. If we're only creating a pattern string format consumed by the runtime API, then I'm still free to create a resource format over the top of that with rich support for (for example) localization, such as comments and other metadata. This is similar to the existing MessageFormat pattern strings today. Other discussions about escaping and such also lead me to believe that this is what the consensus is arriving at.

I do think that some of our initial discussions/tenets are called into question by this. In particular, I think that the XLIFF binding will be (necessarily) incomplete.

For example, the resource format we had at Amazon has base direction metadata or string- and file-level comments that this spec cannot know about. Still, we were compiling our resource format into the runtime format. This spec would supply that runtime format.

@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

In 1-pass file format, the parser parses the syntax of the resource and gets parsed messages directly. In 2-pass it first parses the container format (JSON, XML, TOML, YAML) and then retrieves messages that another parser parses.

Again, nothing prevents one from writing a parser that does json + messages in one pass, other than saving programming effort.

I don't see any conceptual difference between these 3 formats:

key=The message

vs

</properties>
  <entry key="key">The message</entry>
</properties>

vs

properties: {
   'key': 'The message'
}

The number of passes is an implementation details.

What is CSS format? storage or content?

Storage.
Noting prevents me from inventing an xml of json based format that represents the exact same content.
The fact that there is no alternative format (yet?) does not mean it can't be done.
I fact I think there were (are?) proposals of binary css formats.

@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

One important aspect that is not currently expressible in existing resource formats is the explicit association of a comment containing translator-relevant metadata with a message or a group of messages. There are certainly some common practices around this, but those practices are in fact for the most part against the specs of the underlying formats.

We can debate if this is "against the specs" of the Java properties:

/* Some comment there
@param userName This the user name.
*/
msg1 = Hello {userName}

It is very much in the spirit of what Java does in Javadoc.
So I doubt it is "against the spec"
Maybe outside the spec, or "unspecified"

But I will not argue that.
I will instead point to things that are designed exactly for translator-relevant metadata:

Android strings

https://developer.android.com/guide/topics/resources/localization

<string name="countdown">
  <xliff:g id="time" example="5 days">%1$s</xliff:g> until holiday
</string>

The id="time" and example="5 days" is 100% for translation.
They have no runtime role.
In fact they are dropped when compiled to binary form

The ITS W3C standard
https://www.w3.org/TR/its20/

It is designed to work with any XML (and HTML) format.
And it is not at all "against the specs" of the xml format to use namespace.
It is in fact 100% in the spirit of the format.

@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

And we (of course) have in-house formats where this kind of meta info for localization is very much standard.
Same as Addison also said about Amazon.

@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

OK, here is a Google format that is not internal and I can share:

https://github.com/google/app-resource-bundle/wiki/ApplicationResourceBundleSpecification

Take:

  // A message that contains placeholder, referenced by JS code.
  "FOO_123": "Your pending cost is {COST}",
  "@FOO_123": {
    "type": "text",
    "context": "HomePage:MainPanel",
    "description": "balance statement.",
    "source_text": "Your pending cost is {COST}",
    "placeholders": {
       "COST": {
          "example": "$123.45",
          "description": "cost presented with currency symbol"
       }
    },
    "screen": "data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEUAAAD/
//+l2Z/dAAAAM0lEQVR4nGP4/5/h/1+G/58ZDrAz3D/McH8yw83NDDeNGe4U
g9C9zwz3gVLMDA/A6P9/AFGGFyjOXZtQAAAAAElFTkSuQmCC",
    "video": "http://www.youtube.com/ajigliech"
  },

These are 100% meta for localization: description, placeholders / example, placeholders / description, screen, video

@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

I've checked the "TC Message Format 2.0 Resolution" from 2022-03-31, and there is an entry on this topic:

  1. Message bundles
    a. Not in MF2.0. This is currently done by higher levels (eg XLIFF) in many implementations.
    b. Could consider adding as an optional item in an MF2.X, or as separate Message Bundle spec, or Message Group spec. For discussion in parallel.

@eemeli
Copy link
Collaborator Author

eemeli commented Sep 23, 2022

A separate working group on message resources is now being bootstrapped.

@aphillips
Copy link
Member

@eemeli Can I close this?

@aphillips aphillips added resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. out-of-scope? labels Dec 5, 2023
@eemeli
Copy link
Collaborator Author

eemeli commented Dec 5, 2023

@macchiati Would it be possible to move the resource WG repo under unicode-org in GitHub to more clearly note that this is something we're working on in parallel with but not as a part of the message formatting WG?

@aphillips
Copy link
Member

@macchiati Pinging to see if we can bring the resource WG somewhere visible. Otherwise, I intend to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
out-of-scope? resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with syntax or ABNF
Projects
None yet
Development

No branches or pull requests

5 participants