Description
While writing Section 10 on output, I was implementing the formatting in my library, Manatee.Json. In that library, I don't collect annotations in the same way as described in the spec. As a result the annotation output of a successful validation was not fully explored.
I have recently been working on a new validator that does follow the recommendations of the spec in regard to annotation collection and have reached the same questions that many other implementors have asked me regarding annotations in the output format, to which I naively responded that annotation output is analogous to error output.
I aim to fix this. This issue is a compilation of questions and issues I've experienced trying to reimplement output formatting as well as potential solutions to some of them.
1. Annotations vs Nested Results
With the current output, annotations are to be represented as their own output unit within the annotations
collection, including the valid
property and all JSON pointers indicating where the annotation came from and to what part of the instance it applies. Additionally, any nested results were also to be in this array. This conflates annotations with nested results and the only way to identify which were annotations and which were nested results was via the location properties.
To resolve this, we should split these collections. Nested results should remain as they are, but the property name needs to change.
The proposed fix adds (or hijacks, rather) a single property, annotations
to the output unit. It's only present for valid = true
and when there are annotations (no empty arrays). The value is an array of all annotations for this node.
For instance, properties
requires that it adds the names of any validated properties as annotations. To this end, the annotations
on the properties
node will be an array of these property names.
{
"valid": true,
"keywordLocation": "#/properties",
"instanceLocation": "#",
"annotation": [
"foo",
"bar"
]
}
I'm not sure how this fits in with collecting unknown keywords as annotations since that would effectively require a node for each unknown keyword, but maybe that's fine (the location pointer would indicate the keyword, and the annotation value would be the unknown keyword's value).
This idea also may invert the intent behind dropping combining annotations, but I haven't thought through that completely, yet.
This is also in line with @ssilverman's comment regarding how his implementation handles annotations internally. (My new validator uses a similar mechanism.)
2. Nested Nodes
Having the nested result nodes under annotations
for validated results and errors
for invalidated results was, admittedly, me trying to be clever about things. Practically, it's difficult to deal with.
The proposal here is to have these always under a consistent name, e.g. nested
, results
, etc. I'm not too fussed about what the name is, but it should always be that regardless of the value of valid
.
3. Too Many Secrets Nodes
In trying to pare down the list of nodes for the detailed
and basic
formats, I found that a lot of nodes were kept that may not be necessary.
For example, in an error scenario, is it necessary to list passing subschemas of an allOf
? To a person just trying to figure out what's wrong, these are just noise to be filtered out in order to pinpoint the one subschema that failed.
The algorithm for reducing returned values could be adjusted, and maybe needs to be dynamic based on the scenario. Maybe if a schema is valid, I don't care about any of the details and I just want a ✔️. But for failures I want to know only those nodes that failed, but I want detailed info about them, and maybe in a format that mimicks the schema (as detailed
is intended to do). The current formats (and the algorithms to generate them) do not support this.
(Fake internet points for anyone who gets my reference in the title of this section.)
Metadata
Metadata
Assignees
Type
Projects
Status
Activity
jdesrosiers commentedon Aug 26, 2020
Recently, we worked on formalizing the process for embedding schemas of different drafts than the parent schema. If the output format is changed, how would output for embedded schemas work? Currently, it's not a problem to just use 2019-09 for everything because none of the other drafts specify an output format.
I think it should definitely not change output format mid-validation. The output format of the top level schema should be used for the whole validation. I don't think this is controversial, but it is a detail that we'll have to address as we make changes to the output format.
jdesrosiers commentedon Aug 26, 2020
Not sure if I should be piling on here or creating another issue, but this seems like a good time to bring up something that bothers me about the output format.
Currently
keywordLocation
is described as a required properties (technically a SHOULD, but described as required) whileabsoluteKeywordLocation
is described as optional. I think this is backwards. Since draft 2019-09, the value of a schema with a JSON Pointer that crosses a document boundary is undefined. That means thatkeywordLocation
as defined doesn't properly identity a keyword.I can see the value of this keyword for human debugging, so I'm not proposing throwing it out entirely, but it should at least be optional and not a URI (just a pointer).
In my implementation, I don't include
keywordLocation
in the output. Because it's not a true identifier, it would actually be fairly difficult (awkward and messy at the least) for me to implement.jdesrosiers commentedon Aug 26, 2020
I don't think the output format should trim anything. Nodes that are superfluous in one context might be helpful in another. Producing human readable output should be a process that takes the output as an argument and transforms it into something nice. https://github.com/atlassian/better-ajv-errors is an example of this concept except it uses the ajv proprietary output format rather than the standard one. I'm hoping they will come out with a version that supports the standard output format so I don't have to make my own.
If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output. For example, imagine a regex101.com -like online schema validator. It might find some of those "redundant" nodes useful to build visualizations where they are just noise for text-based human reporting.
Therefore, I don't think the spec should be defining an algorithm to reduce returned nodes. We can leave that to third party tools to decide what works in their domain. It's definitely a problem we as implementors need to work on solving, but not at the spec level.
gregsdennis commentedon Aug 26, 2020
This is correct as is. You will always have a location relative to the root schema, which is what
keywordLocation
gives. However, you will only have an absolute location if the root schema defines an absolute URI for$id
. Since$id
is optional in the schema, it makes sense forabsoluteKeywordLocation
to be optional in the output.Additionally,
keywordLocation
gives you how you got to the keyword, including dereferences, which is generally more useful for debugging than just where the keyword is, which is whatabsoluteKeywordLocation
gives you.This is why we're discussing it. A lot of times, you just want the problem. Other times, you'd like the details. We need to isolate the rules around these scenarios. @ssilverman has some logic that he's implemented in his library already that may serve as a starting point.
karenetheridge commentedon Aug 26, 2020
In my implementation, I omit absoluteKeywordLocation from the error/annotation output iff it is identical to keywordLocation (that is, if no $ref was traversed while getting to this location, AND there is no canonical URI provided for the schema) -- I agree that it is helpful to have both pieces of information.
jdesrosiers commentedon Aug 27, 2020
I think you're missing the point. I'm saying that the output format should be like an assembly language. It's a standardized, low level representation of the validation result. Starting from that assembly language, people can build whatever they need that is appropriate for their domain. Since it's impossible for us the predict every way someone might want to use the output format, I think this is the only feasible path.
This sounds like a great argument for why an id should not be optional (not necessarily with
$id
). Otherwise, the output format has no way to properly identify the things it's describing. It requires a human to connect the dots and therefore limits the information I can get from the output programatically. The project I'm working on now would not be possible without an identifier for every schema. The output format should be optimized for machine readability, not human readability. The human readable version can be generated from the machine readable version. (In my implementation I designed the API such that I always have an identifier. The only way to use a schema is to use its identifier. So, I have no issue with leaving outkeywordLocation
because I always have an identifier.)I'd love to see
keywordLocation
optional or even removed, but I'd be happy with changing it to require at least one ofkeywordLocation
orabsoluteKeywordLocation
rather than require justkeywordLocation
.gregsdennis commentedon Aug 27, 2020
That your scenario requires an ID doesn't imply that every scenario will. Relative location has its importance in how you got to the keyword, not just where the keyword is. This is why
keywordLocation
is required.I'm just saying that it needs to be discussed. I'm open to discussion, and some of your points are valid. Let's please keep the tone of this conversation open.
If you have structural suggestions, please share them. Also keep in mind the months of conversational background that led to the current state. There are requirements that we were dealing with that your use case may not have.
This is precisely the approach we took. What is now in the spec evolved from that.
karenetheridge commentedon Aug 28, 2020
Why is
allOf
producing any annotations at all? the spec says: "[These applicator keywords] have no direct impact on annotation collection, although they enable the same annotation keyword to be applied to an instance location with different values. Annotation keywords define their own rules for combining such values." My interpretation of this is that allOf creates no annotations - it merely passes along the annotations from the (passing) subschemas that it contains.PS. @gregdennis in your original problem description above, can you clarify which output format you are discussing?
jdesrosiers commentedon Aug 29, 2020
The point is that the output format needs to be a common denominator for any application that someone might want to use it for. If my use case needs it, others will too. If the output format doesn't require it, my use case and many others won't be compatible with standard validators.
keywordLocation
, on the other hand, can be derived from the standard "detailed" output if everything has an identifier. It can be left out and there are no use cases that would be incompatible with a standard validator input. That's why it can optional.I'm not sure what you mean by that, but if any of what I've written has come across as rude or aggressive or inappropriate in any way, I apologize. It truly was not my intention. I know I can be suborn, but I only persist when I'm not getting my point across. If we understand each other and still don't agree on a resolution, I'm happy to drop it.
I followed along with the conversation and I have the utmost respect for all the effort everyone put in. At the time, I wasn't in a place to contribute to that particular conversation. I'm possibly the first person to try to do something with the output other than reporting validation results. I bring up my use case not to make this about me, but to show an example of where the current output format is insufficient. I think that's valuable feedback, but if it's not helpful, I'm happy to drop it.
gregsdennis commentedon Aug 30, 2020
This question isn't saying that
allOf
would be generating, or even carrying, annotations. It's merely asking whether the passing subschemas of a failedallOf
need to be included in the output. In fact, it explicitly would not return any annotations (generated or passed along) because it failed.I'm asking for an overhaul of the output. I think the structure of the node as well as the tree structure (of each format) of the output should be revisited. Question 1 looks at the node structure, whereas questions 2 & 3 look more at the tree structure.
32 remaining items
gregsdennis commentedon Oct 5, 2021
Working branch is draft-next-output.
jdesrosiers commentedon Oct 5, 2021
Unless I'm missing some contradicting wording somewhere else,
detailed
is not a SHOULD by itself. Implementingflag
orbasic
covers the requirement as well.gregsdennis commentedon Oct 5, 2021
🤔 yeah, I see what you're getting at. I'll do some digging to see where we decided that. Do you have an opinion on this requirement?
gregsdennis commentedon Oct 5, 2021
Looks like the most relevant conversations are:
jdesrosiers commentedon Oct 5, 2021
I don't have a strong opinion. I'd suggest that implementations MUST support
flag
and SHOULD supportbasic
ordetailed
. Anything beyond that is just nice-to-have.gregsdennis commentedon Oct 6, 2021
Given that I'm expanding
detailed
intodetailed-schema
(what we have currently) anddetailed-instance
(an instance-based format), would you say that implementations SHOULD provide just one of them? It seems odd to prefer one over another, whether by the spec or the implementation.I do like requiring
flag
, though.jdesrosiers commentedon Oct 6, 2021
I agree that we probably shouldn't prefer one over the other. So, I'd suggest: MUST support
flag
and SHOULD supportbasic
,detailed-schema
, ordetailed-instance
. I would personally (not in the spec) encourage implementations to support all of the output formats because each is useful in different situations, but as far as minimum requirements, at least one format with error details should be sufficient.karenetheridge commentedon Oct 6, 2021
I haven't had a chance to go over all of this in detail and respond to all the points that I want to respond to -- but I wonder if it might be worthwhile moving to a community discussion so we can have threaded conversations about the various points? Many of them are not directly related, and some require a deep dive to understand motivations and to examine tradeoffs. (We might need an ADR for these, too!)
gregsdennis commentedon Oct 7, 2021
Sounds good. I'll copy this to a discussion, making edits for the conversation @jdesrosiers and I have had so far. We can create threads for each item.
gregsdennis commentedon Oct 7, 2021
We don't have GH discussions on in this repo so I opened one in the Community repo with the General topic: json-schema-org/community#63
gregsdennis commentedon Sep 26, 2022
I think this is basically resolved at this point. Closing for now.