Valid or well-formed language tags? #36

r12a · 2019-09-11T02:34:30Z

Comments arising from self-review at either w3c/json-ld-wg#93 or w3c/pub-manifest#38 (not clear which).

from Matt Garrish:

According to BCP47, a well-formed tag is only one that conforms to the ABNF. A valid one is one that also has a recognized language tag (plus some other restrictions).https://tools.ietf.org/html/bcp47#section-2.2.9
Sure we only want well-formed tags? Our schema looks more attuned to valid.

from Ivan Herman:

HTML requires validity:
https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes
The only counter-argument I may have for validity is that I have no idea how that could be properly (and easily) checked by an implementation….

r12a · 2019-09-11T02:43:16Z

This was discussed during the i18n telecon, and @aphillips wrote a proposed edit to clarify why well-formed is usually sufficient. See https://github.com/w3c/bp-i18n-specdev/pull/34/files/105cc74bae89d08312d87736c5bb15b26fc450a8..8ce3958fea79166494296788c7e6162999d4d5fc for the PR, and https://aphillips.github.io/bp-i18n-specdev/#sec_lang_values for the rendered version.

mattgarrish · 2019-09-12T14:02:27Z

I realize user agents probably don’t care about all of the parts of the tag, but there’s no lenience between strict adherence and no checking, and that still makes picking one over the other hard to assess (at least for me!).

If we only say that tags be well-formed, then, as I understand it, I can write this:

“@context”: {
     “language”: “em”
}

instead of “en”, and it won't result in a warning because it’s well formed.

The problem here is that it leads to subtle bugs. The only indication of a mistake may come when a user agent fails to load a dictionary or preload a tts engine, for example, which may not be realized until a publication has already reached the user.

If we chose strict validity, then every subtag has to be valid, and I agree that in most cases it's not information that the user agent cares about. For us, it's probably also information that isn't going to be specified or checked.

But given the two extremes, it seems more practical to warn users about the language being invalid, even if the rest of the subtags go unexamined. How do we go about this, though?

Is it reasonable to assert well-formedness and also require a valid language as an additional requirement?

aphillips · 2019-09-13T02:08:46Z

@mattgarrish Thanks for the comment.

Generally, you want to require valid language tags in content, even if your normative requirement on implementations only extends to well-formed checking. Most specifications are second-order consumers of language metadata--they are using data already provided in the document format (HTML @lang, XML xml:lang, or the document format's language fields/attributes).

Generally most specifications are concerned with selecting resources (such as spell check, tokenizers, fonts, etc.) or with matching (selecting which string to show, for example) and don't directly care about the content of the language tag. Invalid-but-well-formed tags just don't match anything and usually fallback schemes provide some behavior that is appropriate.

There might be cases where a specification really wants implementation-level checking. In those cases, the result of a tag failing to be valid has to be specified (die? warn? what?). It's also a problem that the registry changes over time, so each implementation is registry-version dependent. The changes over time are small, minor, and mostly "not that interesting", but they do exist and real users may encounter interoperability issues if random (out of date) spec implementations start barfing on their (perfectly valid) language tags.

So I generally agree with you and the edit I'm working on hopefully spells this out better than the current text. Thoughts?

mattgarrish · 2019-09-16T00:53:18Z

Sure, that's true. We inherit language from json-ld, and even the inLanguage property we inherit from schema.org already says to "please" use a tag from bcp47, which I assume means a registered tag. We're not actually defining our own.

I just brought this up with the publishing working group and it appears that no one cares too much about invalid language tags, assuming, as you say, that the end result is no harm. So I guess I'm just the outlier worrying too much. :)

aphillips · 2022-05-12T16:25:50Z

I think this is already dealt with by the current set of recommendations. I am closing this issue, but will file new ones on the section about BCP47, since there's actually text in that section suggesting more work :-(.

dauwhe mentioned this issue Feb 18, 2021

(i18n) What should the RS do if a language value is not well formed? w3c/epub-specs#1508

Closed

mattgarrish mentioned this issue Feb 22, 2021

(i18n) Using a "should" for a valid language tag? w3c/epub-specs#1509

Closed

aphillips closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Valid or well-formed language tags? #36

Valid or well-formed language tags? #36

r12a commented Sep 11, 2019

r12a commented Sep 11, 2019

Uh oh!

mattgarrish commented Sep 12, 2019

Uh oh!

aphillips commented Sep 13, 2019

Uh oh!

mattgarrish commented Sep 16, 2019

Uh oh!

aphillips commented May 12, 2022

Uh oh!