-
Notifications
You must be signed in to change notification settings - Fork 35
Require valid language tags #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Merged
SHOULD instead of MUST ? |
This issue was discussed in a meeting.
View the transcriptLanguage tag normalizationGregg Kellogg: #167 Gregg Kellogg: This is a slightly different issue, but they are quite related. One is that JSON-LD requires that language tags be normalized to lowercase. All other RDF concepts says that implementations MAY do that. My implementation does that and it is important for SPARQL querying otherwise you have to implement that at query time which is impractical for most of us. … That’s the genesis of that – the fact that JSON-LD makes it a requirement is a little odd but it makes it easier to test the results. … Dave might have a recollection of that. … My thought is that we pull back on MUST and go to MAY and add a provision in the testing README that says for testing purposes, implementations need to consider that language tags need to vary so they can properly report. Ivan Herman: The problem I see with the current setup, if I do roundtripping, if I do one of the various algorithms and I take the outcoming one then it will all be normalized to lowercase. This is not what i18n people would expect (e.g., they want zh-Hans) Gregg Kellogg: It’s not just RDF roundtripping, it happens in expansion. Ivan Herman: Yes, and the problem is that the habit in the i18n community is for whatever reason to say “en-UK” in capital letters. There are all kinds of additional rules that users may expect to see and that they use when they create JSON-LD and it will go away when it roundtrips and this may be surprises. Rob Sanderson: One thing to consider – while, technically worth the proposal, we should consider the backwards compatibility with 1.0 that we’re changing a requirement to reduce it down to a MAY. … Any processor that’s relying on that requirement may stop functioning. Gregg Kellogg: That should properly exist in the normalization algorithm, it can sign any RDF, so if you’re coming from any other format, so there is no guarantee coming from other syntaxes. Dave Longley: Was just going to say it does need to be normalized for signing. Can put it to a different layer, but we can’t prohibit it … tried to make it clear that normalization can happen and does need to with a concrete syntax … when you’re serializing you need to output lowercase so it’s the same binary stream of bytes to sign Ivan Herman: only when you do a signature … but if you’re just producing turtle there’s no requirement Gregg Kellogg: The output of the normalization routine is quads with special for bnodes Ivan Herman: Question arose is that even if we agree to do it in an upper layer, the question to answer is there software that will break? … other implementations of the signature for example Gregg Kellogg: Not part of JSON-LD. If you were to normalize from another serialization it would still exist Dave Longley: You are right. Not sure if we’re changing from MUST or MUST NOT … that would also have an impact … but as gregg is saying, the libraries should be normalizing Gregg Kellogg: Recommendation is to follow RDF concepts and go with a MAY Rob Sanderson: Any other thoughts? … General agreement that the canonize libraries should be outputting lowercase anyway and we can remove this hard restriction. Gregg Kellogg: It seems like we should resolve to do it. It probably means creating a lot of changes to the test suite. Proposed resolution: Change requirement to lowercase language tags to be a MAY from MUST (Rob Sanderson) Ivan Herman: +1 Tim Cole: +1 Rob Sanderson: +1 Pierre-Antoine Champin: +1 Dave Longley: +1 Benjamin Young: +1 Gregg Kellogg: +1 Resolution #4: Change requirement to lowercase language tags to be a MAY from MUST Rob Sanderson: There was a suggestion to do SHOULD… Gregg Kellogg: No, I don’t think we should do that, this aligns with other RDF serializations. 4.4. Language tag values and validation Rob Sanderson: #167 (still) Gregg Kellogg: The other thing – whether language tag values need to be validated. … Currently the specs say different things. The definition says the language tag MUST be well formed according to BCP47. But the algorithm says implementations must not attempt to fix any invalid IRIs or language tags. I think that’s been interpreted to not validate language tags. Ivan Herman: Why is it there? Gregg Kellogg: I can see why you shouldn’t fix URIs, but not language tags. Dave Longley: I don’t recall either. Gregg Kellogg: JSON-LD Processors MUST NOT attempt to correct malformed IRIs or language tags; however, they MAY issue validation warnings. IRIs are not modified other than conversion between relative and absolute IRIs. Gregg Kellogg: Clearly, to be consistent, to say they must be well formed then the algorithms should be updated and a version of Ruben’s PR should be updated. Rob Sanderson: I agree we’re inconsistent. For the correcting, is the implementation that if there was say, an errant space, “en “ that you shouldn’t try to guess that it was just supposed to be “en”? Gregg Kellogg: That’s what the current text says. I believe the new text would say that that form of a language tag would be invalid and that processing for that step … either the algorithm would be aborted at that point or that particular language tag would be ignored. … I think for it to be an error would be that the promise would be rejected. Ivan Herman: I think my feeling is that – the promise is rejected but the whole JSON-LD processing throws up its arms or the language tag is ignored? Gregg Kellogg: If we do that, the only way we have to do that is to issue a warning. Rejecting the promise means we abort the whole step. Rob Sanderson: The question to me … is this more like a syntax error, in which case we should abort because it’s rubbish. Gregg Kellogg: For any other RDF serialization it’s an error. Rob Sanderson: Or is it more like an undefined property name at which point we’d ignore it. Pierre-Antoine Champin: I think we’re very much in the case where we might break schema.org data. We have no data that people out there are using proper language tags and old processors accept them than we might make Dan Brickley very angry and rightfully so. … I would lean towards ignoring them as undefined properties. Gregg Kellogg: 1+ Dave Longley: I’m of two minds. If this was a fresh spec without implementations, theres no question we should abort Ivan Herman: +1 to dlongley Benjamin Young: +1 to dlongley Dave Longley: if we’re concerned about existing systems, we should keep it rather than dropping it Gregg Kellogg: I think that’s a compelling argument. We should issue a warning but continue processing. The exception would be for the i18n datatype and compound literal, that would be an opportunity to lock that down. What we currently say is that we can’t generate any invalid triples. … In the case that this was a language tag that didn’t match the REGEX then it would be reasonable that we reject one coming from the i18n URI or the compound literal as a hard failure. … That being the case, for warning purposes and rejection purposes, there are two options of REGEXes for how to verify that. Ivan Herman: Yes, Dave’s argument is compelling. The only thing we can say without any problems – is that the processor MUST issue a warning from now on that this is illegal. But we should not stop processing. Dave Longley: +1 not testable Gregg Kellogg: We don’t have a way to test for warnings. … From Pierre-Antoine and Dave both – I don’t know how much data is there from schema.org … we could say if you’re in a hard 1.0 mode you let it pass but reject in 1.1 but that flies in the face of what we’ve been doing. … We drop triples that are invalid and that’s it, that’s the current behavior. Pierre-Antoine Champin: If we did specify a way to turn warnings into errors, and that’s often a standard kind of thing, we could force those tests to use this mode and get the error and that would make it testable. Gregg Kellogg: I think the way we’d do that is to add an option to treat language tag problems as errors. … There are other places we generate warnings. Rob Sanderson: The current behavior, per the issue, is that current processors may issue validation warnings but they don’t have to. Dave Longley: I would only support doing a hard error if we found that existing implementations already did it. Gregg Kellogg: I think since all expansion tests are toRDF tests – anything that failed to transform would have been caught. Proposed resolution: Update the API conformance section to say that invalid language tags SHOULD generate a warning (Rob Sanderson) Ivan Herman: +1 Rob Sanderson: +1 Benjamin Young: +1 Gregg Kellogg: +1 Tim Cole: +1 Pierre-Antoine Champin: +1 Dave Longley: +1 Resolution #5: Update the API conformance section to say that invalid language tags SHOULD generate a warning Gregg Kellogg: The question becomes how do we validate language tags, there are some different REGEXes out there. … RDF might accept something that is not a valid language tag … because it’s REGEX is more loose. … In the RDF translation it’s not just a warning. Gregg Kellogg: [144s] LANGTAG ::= " @" [a-zA-Z]+ ( "-" [a-zA-Z0-9]+ )* (from Turtle)Rob Sanderson: +1 Gregg Kellogg: obs-language-tag = primary-subtag *( "-" subtag ) Gregg Kellogg: primary-subtag = 1*8ALPHA Gregg Kellogg: subtag = 1*8(ALPHA / DIGIT) Ivan Herman: Turtle is much looser – the regex from BCP47 is way more complicated, multi line, etc. Gregg Kellogg: There may be something even more complicated than what’s in BCP47. Pierre-Antoine Champin: I wanted to point out – Ivan you pointed me at the awful regex and that one is trying to qualify the different parts of the BCP47. It was for capturing the various parts in a language tag, the other one for validation is much simpler. Proposed resolution: Use the less strict RDF regex, not BCP47’s, to determine validity (Rob Sanderson) Rob Sanderson: +1 Pierre-Antoine Champin: +1 Dave Longley: +1 Ivan Herman: +1 Tim Cole: +0 Gregg Kellogg: +0 Gregg Kellogg: Given that, I’d vote for BCP47 … I think the only difference is in specifying it, it’s not a real difference. … It’s only compatible when other syntaxes have accepted something that’s incompatible with BCP47. Benjamin Young: +1 Ivan Herman: I think we would get problems if we weakened our check from before. From the i18n people in particular. … They checked Turtle. … It might be a bug report for Turtle/on the RDF specs. Gregg Kellogg: Because they can accept things that aren’t legal BCP47. Proposed resolution: Use stricter BCP47 syntax and file an errata for RDF to request an update to its definition to match BCP47 (Rob Sanderson) Ivan Herman: The i18n review was ineffective … it was much weaker. Let alone the issue around direction. This also shows it was weak, I didn’t realize that but that’s the way it is. I think it’s a bug over there. Rob Sanderson: +1 Dave Longley: +1 Ivan Herman: +1 Gregg Kellogg: +1 Pierre-Antoine Champin: +1 Tim Cole: +1 Benjamin Young: +1 Resolution #6: Use stricter BCP47 syntax and file an errata for RDF to request an update to its definition to match BCP47 |
This was referenced Oct 13, 2019
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, JSON-LD doesn't do any language-tag validation. However, [RDF Concepts[(https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal) does require that language tags be well formed according to BCP47:
The other RDF syntaxes use something like the following to validate language tags:
I could see us changing from a MAY validate language tags to MUST be valid either to
[a-zA-Z]+ ( "-" [a-zA-Z0-9]+ )*
, or the more strict ABNF from BCP47:This should probably be done in context processing, expansion, and fromRDF algorithms to be consistent. The RDF Concepts regex is probably the one to use, as we're in that family.
We do say that a language tag MUST be well formed:
The Conformance Section of the API document waffles on this:
Note the fine distinction between correction and validate and that warnings MAP be issued, rather than errors MUST be generated.
This should probably change to processors MUST generate an error and either abort or ignore invalid IRIs or language tags (or base directions). We do require that these be well formed when generating RDF.
The text was updated successfully, but these errors were encountered: