Skip to content

Valid or well-formed language tags? #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
r12a opened this issue Sep 11, 2019 · 5 comments
Closed

Valid or well-formed language tags? #36

r12a opened this issue Sep 11, 2019 · 5 comments

Comments

@r12a
Copy link
Contributor

r12a commented Sep 11, 2019

Comments arising from self-review at either w3c/json-ld-wg#93 or w3c/pub-manifest#38 (not clear which).

from Matt Garrish:

According to BCP47, a well-formed tag is only one that conforms to the ABNF. A valid one is one that also has a recognized language tag (plus some other restrictions).https://tools.ietf.org/html/bcp47#section-2.2.9
Sure we only want well-formed tags? Our schema looks more attuned to valid.

from Ivan Herman:

HTML requires validity:
https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes
The only counter-argument I may have for validity is that I have no idea how that could be properly (and easily) checked by an implementation….

@r12a
Copy link
Contributor Author

r12a commented Sep 11, 2019

This was discussed during the i18n telecon, and @aphillips wrote a proposed edit to clarify why well-formed is usually sufficient. See https://github.com/w3c/bp-i18n-specdev/pull/34/files/105cc74bae89d08312d87736c5bb15b26fc450a8..8ce3958fea79166494296788c7e6162999d4d5fc for the PR, and https://aphillips.github.io/bp-i18n-specdev/#sec_lang_values for the rendered version.

@mattgarrish
Copy link
Member

I realize user agents probably don’t care about all of the parts of the tag, but there’s no lenience between strict adherence and no checking, and that still makes picking one over the other hard to assess (at least for me!).

If we only say that tags be well-formed, then, as I understand it, I can write this:

“@context”: {
     “language”: “em”
}

instead of “en”, and it won't result in a warning because it’s well formed.

The problem here is that it leads to subtle bugs. The only indication of a mistake may come when a user agent fails to load a dictionary or preload a tts engine, for example, which may not be realized until a publication has already reached the user.

If we chose strict validity, then every subtag has to be valid, and I agree that in most cases it's not information that the user agent cares about. For us, it's probably also information that isn't going to be specified or checked.

But given the two extremes, it seems more practical to warn users about the language being invalid, even if the rest of the subtags go unexamined. How do we go about this, though?

Is it reasonable to assert well-formedness and also require a valid language as an additional requirement?

@aphillips
Copy link
Contributor

@mattgarrish Thanks for the comment.

Generally, you want to require valid language tags in content, even if your normative requirement on implementations only extends to well-formed checking. Most specifications are second-order consumers of language metadata--they are using data already provided in the document format (HTML @lang, XML xml:lang, or the document format's language fields/attributes).

Generally most specifications are concerned with selecting resources (such as spell check, tokenizers, fonts, etc.) or with matching (selecting which string to show, for example) and don't directly care about the content of the language tag. Invalid-but-well-formed tags just don't match anything and usually fallback schemes provide some behavior that is appropriate.

There might be cases where a specification really wants implementation-level checking. In those cases, the result of a tag failing to be valid has to be specified (die? warn? what?). It's also a problem that the registry changes over time, so each implementation is registry-version dependent. The changes over time are small, minor, and mostly "not that interesting", but they do exist and real users may encounter interoperability issues if random (out of date) spec implementations start barfing on their (perfectly valid) language tags.

So I generally agree with you and the edit I'm working on hopefully spells this out better than the current text. Thoughts?

@mattgarrish
Copy link
Member

Sure, that's true. We inherit language from json-ld, and even the inLanguage property we inherit from schema.org already says to "please" use a tag from bcp47, which I assume means a registered tag. We're not actually defining our own.

I just brought this up with the publishing working group and it appears that no one cares too much about invalid language tags, assuming, as you say, that the end result is no harm. So I guess I'm just the outlier worrying too much. :)

@aphillips
Copy link
Contributor

I think this is already dealt with by the current set of recommendations. I am closing this issue, but will file new ones on the section about BCP47, since there's actually text in that section suggesting more work :-(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants