Skip to content

Including resource-level metadata in the syntax #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eemeli opened this issue Jul 30, 2023 · 3 comments · Fixed by #15
Closed

Including resource-level metadata in the syntax #14

eemeli opened this issue Jul 30, 2023 · 3 comments · Fixed by #15

Comments

@eemeli
Copy link
Owner

eemeli commented Jul 30, 2023

The result of formatting a message depends on the environment in which that's done. With MF2, at least the following are relevant:

  • The locale of the message
  • The functions that are available as annotations
  • The meanings of any private-use annotations
  • Later, the MF2 version if any reserved annotations are defined

As these attributes are likely to be common to all messages in a single resource, it would probably make sense to include syntax or conventions for their declaration. These might not necessarily be used during the formatting runtime as then their values would be implicit, but would at least prove invaluable to translators and automated tools processing messages.

I'm aware of at least the following prior art that may be relevant to consider here:

  • A gettext header entry may include fields such as Language and Plural-Forms that apply to the entire resource. The header entry uses msgid "" to identify itself.
  • The browser extension messages.json files rely on being placed in a well-defined directory structure to identify the locale for their contents.
  • Java ResourceBundle file names encode the locale, so a base Resource.properties would use Resource_de_CH.properties for its de-CH locale.
  • An XLIFF 1.2 <file> element includes at least the source-language attribute, and its other attributes and <header> element may provide significantly more context about the resource.
  • YAML supports the %YAML directive, which defines the YAML version that's used by the document, e.g. %YAML 1.2.

Of the above, browser extensions and ResourceBundles stand out by incorporating their locale information within their file or directory name. I don't think this approach would work well for our purposes, given that not all of the relevant information is easily expressible via a locale identifier.

I think we should instead do something similar to the other formats, and incorporate metadata into the file using a syntax that's easy to parse (in particular for runtimes that don't care about the metadata), sufficiently expressive, but also extensible for later use cases that are not yet identifiable.

Of the fields I list above, I think the available functions and private-use annotations could be identified together via some "schema", for which we could use an identifier that references an external definition. With that we're left with key-value pairs that ought to each fit into a single line:

  • locale, obviously using a BCP47 identifier
  • schema, defined via URL or some other structured string identifier
  • version, with a numerical string like '2.0' identifying the spec version

This leaves a couple of open questions that ought to be answered:

  1. How should the information be encoded? Structured data within comments, messages using some predefined keys, or using some new syntax?
  2. Where and how are the schemas identified?
  3. Are there additional fields not yet under consideration that would not fit a simple key-value string shape?

Sidenote: One interesting possibility would be to use something like format instead of version, and to incorporate the content format in the value, via e.g. 'messageformat-2.0'. This would potentially allow for the resource format to also support other message formats.

@eemeli
Copy link
Owner Author

eemeli commented Aug 1, 2023

Thought of another example of related prior art:

A number of markdown and other processors such as Pandoc and Jekyll support including a frontmatter section at the top of a document, with YAML contents and separated from the document body by a --- line. This often includes fields like title that pertain to the whole document, and potentially its formatting.

@eemeli
Copy link
Owner Author

eemeli commented Aug 1, 2023

To expand a bit on the previous, YAML seems like the most common but not universal format for frontmatter with the --- separator; some like Markdoc support pretty much anything. Others detect the format from the contents.

Some systems like Hugo and remark use different indicators for different frontmatter formats. Of these, the only one that appears to have a universal meaning is +++ for TOML.

Others (e.g. 11ty, VuePress) use a keyword after the start indicator for the format, like ---toml.

The origin of the --- as frontmatter delimiters is its use in YAML as an optional directives-end or document-start document marker; I think Pandoc was the first to take up the practice, and they maintain that ... (the YAML document-end marker) should be used, but as with most other users, the --- has become more common.

@eemeli
Copy link
Owner Author

eemeli commented Aug 2, 2023

In conversation with @zbraniecki elsewhere, he highlighted projectfluent/fluent#139:

In Fluent comments can be attached to one of three levels:

  • Resource
  • Group
  • Message

This proposal was intended to specify resource level metadata, like locale etc. but because of how Fluent approaches comment levels, there's nothing preventing us from doing message or group level meta data (overrides?)

And then regarding the scope of the problem we ought to be looking at here:

My concern about your approach is that you're slicing another thing that I see as continuous - metadata. You separate "resource metadata" as a standalone entity (with a proposed own sigil), while any other metadata would have to be solved separately.

I suggest thinking of any and all semantic meta data in the same way - whether you're annotating a single message, a single variant, a single assignment, a group, a resource.

You can argue that there are types of meta data - for example some may be useful at runtime, other will never be, and separate sigils per that, but I'd suggest expanding the scope of what you're thinking of to also allow for this meta data to be attachable to other levels than just resource.

To which I replied:

Mostly for now I’m trying to identify the scope of the problem(s) around metadata that ought to be solved. This issue is specifically about resource-level metadata because that’s not well supported by the current resource ABNF, and if we agree that it may include fields like locale, it ought not to be shunted into a comment. For metadata attached to smaller scopes, I at least have not yet heard of an argument why or how they could be relevant to a formatting runtime, so I wouldn’t want to conclude at least yet that all metadata fields should or should not be within comments.

The Fluent Semantic Comments proposal relies in part on using multiple different comment prefixes to target the file/group/message level, so with something like that a single message’s metadata field starts with # @ while a resource metadata field would start with ### @. This differentiation by # count does not at least yet exist in MF2 resources, but in Fluent it allows for a clear distinction between e.g. resource and message metadata. Another relevant difference is that MF2 resources do include explicit sections, so the same “attach to the next thing” logic that works for message comments also applies for section comments. But it doesn’t work at the resource level, so some different approach is required. Right now the ABNF includes this comment on it:

; A first comment in a resource preceding any section-head or entry
; and followed by an empty line attaches to the whole resource.

I agree that there is a larger scope including e.g. message-level metadata that needs to be addressed by a solution, but the edges of it are a bit fuzzy and the shape of the solution is not set in stone. For example, while writing this I’ve started to consider whether including a header separator like --- as used by markdown processors at the end of the document’s frontmatter might also make sense for MF2. It would give a syntactical “next thing” that resource-level comments and metadata could target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant