Skip to content

Restructuring of output formatting #973

Closed
@gregsdennis

Description

@gregsdennis

While writing Section 10 on output, I was implementing the formatting in my library, Manatee.Json. In that library, I don't collect annotations in the same way as described in the spec. As a result the annotation output of a successful validation was not fully explored.

I have recently been working on a new validator that does follow the recommendations of the spec in regard to annotation collection and have reached the same questions that many other implementors have asked me regarding annotations in the output format, to which I naively responded that annotation output is analogous to error output.

I aim to fix this. This issue is a compilation of questions and issues I've experienced trying to reimplement output formatting as well as potential solutions to some of them.

1. Annotations vs Nested Results

With the current output, annotations are to be represented as their own output unit within the annotations collection, including the valid property and all JSON pointers indicating where the annotation came from and to what part of the instance it applies. Additionally, any nested results were also to be in this array. This conflates annotations with nested results and the only way to identify which were annotations and which were nested results was via the location properties.

To resolve this, we should split these collections. Nested results should remain as they are, but the property name needs to change.

The proposed fix adds (or hijacks, rather) a single property, annotations to the output unit. It's only present for valid = true and when there are annotations (no empty arrays). The value is an array of all annotations for this node.

For instance, properties requires that it adds the names of any validated properties as annotations. To this end, the annotations on the properties node will be an array of these property names.

{
    "valid": true,
    "keywordLocation": "#/properties",
    "instanceLocation": "#",
    "annotation": [
        "foo",
        "bar"
    ]
}

I'm not sure how this fits in with collecting unknown keywords as annotations since that would effectively require a node for each unknown keyword, but maybe that's fine (the location pointer would indicate the keyword, and the annotation value would be the unknown keyword's value).

This idea also may invert the intent behind dropping combining annotations, but I haven't thought through that completely, yet.

This is also in line with @ssilverman's comment regarding how his implementation handles annotations internally. (My new validator uses a similar mechanism.)

2. Nested Nodes

Having the nested result nodes under annotations for validated results and errors for invalidated results was, admittedly, me trying to be clever about things. Practically, it's difficult to deal with.

The proposal here is to have these always under a consistent name, e.g. nested, results, etc. I'm not too fussed about what the name is, but it should always be that regardless of the value of valid.

3. Too Many Secrets Nodes

In trying to pare down the list of nodes for the detailed and basic formats, I found that a lot of nodes were kept that may not be necessary.

For example, in an error scenario, is it necessary to list passing subschemas of an allOf? To a person just trying to figure out what's wrong, these are just noise to be filtered out in order to pinpoint the one subschema that failed.

The algorithm for reducing returned values could be adjusted, and maybe needs to be dynamic based on the scenario. Maybe if a schema is valid, I don't care about any of the details and I just want a ✔️. But for failures I want to know only those nodes that failed, but I want detailed info about them, and maybe in a format that mimicks the schema (as detailed is intended to do). The current formats (and the algorithms to generate them) do not support this.

(Fake internet points for anyone who gets my reference in the title of this section.)

Activity

added this to the draft-09 milestone on Aug 26, 2020
jdesrosiers

jdesrosiers commented on Aug 26, 2020

@jdesrosiers
Member

Recently, we worked on formalizing the process for embedding schemas of different drafts than the parent schema. If the output format is changed, how would output for embedded schemas work? Currently, it's not a problem to just use 2019-09 for everything because none of the other drafts specify an output format.

I think it should definitely not change output format mid-validation. The output format of the top level schema should be used for the whole validation. I don't think this is controversial, but it is a detail that we'll have to address as we make changes to the output format.

jdesrosiers

jdesrosiers commented on Aug 26, 2020

@jdesrosiers
Member

Not sure if I should be piling on here or creating another issue, but this seems like a good time to bring up something that bothers me about the output format.

Currently keywordLocation is described as a required properties (technically a SHOULD, but described as required) while absoluteKeywordLocation is described as optional. I think this is backwards. Since draft 2019-09, the value of a schema with a JSON Pointer that crosses a document boundary is undefined. That means that keywordLocation as defined doesn't properly identity a keyword.

I can see the value of this keyword for human debugging, so I'm not proposing throwing it out entirely, but it should at least be optional and not a URI (just a pointer).

In my implementation, I don't include keywordLocation in the output. Because it's not a true identifier, it would actually be fairly difficult (awkward and messy at the least) for me to implement.

jdesrosiers

jdesrosiers commented on Aug 26, 2020

@jdesrosiers
Member

In trying to pare down the list of nodes for the detailed and basic formats, I found that a lot of nodes were kept that may not be necessary.

I don't think the output format should trim anything. Nodes that are superfluous in one context might be helpful in another. Producing human readable output should be a process that takes the output as an argument and transforms it into something nice. https://github.com/atlassian/better-ajv-errors is an example of this concept except it uses the ajv proprietary output format rather than the standard one. I'm hoping they will come out with a version that supports the standard output format so I don't have to make my own.

If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output. For example, imagine a regex101.com -like online schema validator. It might find some of those "redundant" nodes useful to build visualizations where they are just noise for text-based human reporting.

Therefore, I don't think the spec should be defining an algorithm to reduce returned nodes. We can leave that to third party tools to decide what works in their domain. It's definitely a problem we as implementors need to work on solving, but not at the spec level.

gregsdennis

gregsdennis commented on Aug 26, 2020

@gregsdennis
MemberAuthor

Currently keywordLocation is described as a required properties (technically a SHOULD, but described as required) while absoluteKeywordLocation is described as optional. - @jdesrosiers

This is correct as is. You will always have a location relative to the root schema, which is what keywordLocation gives. However, you will only have an absolute location if the root schema defines an absolute URI for $id. Since $id is optional in the schema, it makes sense for absoluteKeywordLocation to be optional in the output.

Additionally, keywordLocation gives you how you got to the keyword, including dereferences, which is generally more useful for debugging than just where the keyword is, which is what absoluteKeywordLocation gives you.


If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output. - @jdesrosiers

This is why we're discussing it. A lot of times, you just want the problem. Other times, you'd like the details. We need to isolate the rules around these scenarios. @ssilverman has some logic that he's implemented in his library already that may serve as a starting point.

karenetheridge

karenetheridge commented on Aug 26, 2020

@karenetheridge
Member

Additionally, keywordLocation gives you how you got to the keyword, including dereferences, which is generally more useful for debugging than just where the keyword is, which is what absoluteKeywordLocation gives you.

In my implementation, I omit absoluteKeywordLocation from the error/annotation output iff it is identical to keywordLocation (that is, if no $ref was traversed while getting to this location, AND there is no canonical URI provided for the schema) -- I agree that it is helpful to have both pieces of information.

jdesrosiers

jdesrosiers commented on Aug 27, 2020

@jdesrosiers
Member

If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output.

This is why we're discussing it.

I think you're missing the point. I'm saying that the output format should be like an assembly language. It's a standardized, low level representation of the validation result. Starting from that assembly language, people can build whatever they need that is appropriate for their domain. Since it's impossible for us the predict every way someone might want to use the output format, I think this is the only feasible path.

You will always have a location relative to the root schema, which is what keywordLocation gives. However, you will only have an absolute location if the root schema defines an absolute URI for $id.

This sounds like a great argument for why an id should not be optional (not necessarily with $id). Otherwise, the output format has no way to properly identify the things it's describing. It requires a human to connect the dots and therefore limits the information I can get from the output programatically. The project I'm working on now would not be possible without an identifier for every schema. The output format should be optimized for machine readability, not human readability. The human readable version can be generated from the machine readable version. (In my implementation I designed the API such that I always have an identifier. The only way to use a schema is to use its identifier. So, I have no issue with leaving out keywordLocation because I always have an identifier.)

I'd love to see keywordLocation optional or even removed, but I'd be happy with changing it to require at least one of keywordLocation or absoluteKeywordLocation rather than require just keywordLocation.

gregsdennis

gregsdennis commented on Aug 27, 2020

@gregsdennis
MemberAuthor

The project I'm working on now would not be possible without an identifier for every schema

That your scenario requires an ID doesn't imply that every scenario will. Relative location has its importance in how you got to the keyword, not just where the keyword is. This is why keywordLocation is required.

I think you're missing the point.

I'm just saying that it needs to be discussed. I'm open to discussion, and some of your points are valid. Let's please keep the tone of this conversation open.

If you have structural suggestions, please share them. Also keep in mind the months of conversational background that led to the current state. There are requirements that we were dealing with that your use case may not have.

It's a standardized, low level representation of the validation result.

This is precisely the approach we took. What is now in the spec evolved from that.

karenetheridge

karenetheridge commented on Aug 28, 2020

@karenetheridge
Member

For example, in an error scenario, is it necessary to list passing subschemas of an allOf?

Why is allOf producing any annotations at all? the spec says: "[These applicator keywords] have no direct impact on annotation collection, although they enable the same annotation keyword to be applied to an instance location with different values. Annotation keywords define their own rules for combining such values." My interpretation of this is that allOf creates no annotations - it merely passes along the annotations from the (passing) subschemas that it contains.

PS. @gregdennis in your original problem description above, can you clarify which output format you are discussing?

jdesrosiers

jdesrosiers commented on Aug 29, 2020

@jdesrosiers
Member

That your scenario requires an ID doesn't imply that every scenario will.

The point is that the output format needs to be a common denominator for any application that someone might want to use it for. If my use case needs it, others will too. If the output format doesn't require it, my use case and many others won't be compatible with standard validators. keywordLocation, on the other hand, can be derived from the standard "detailed" output if everything has an identifier. It can be left out and there are no use cases that would be incompatible with a standard validator input. That's why it can optional.

Let's please keep the tone of this conversation open.

I'm not sure what you mean by that, but if any of what I've written has come across as rude or aggressive or inappropriate in any way, I apologize. It truly was not my intention. I know I can be suborn, but I only persist when I'm not getting my point across. If we understand each other and still don't agree on a resolution, I'm happy to drop it.

Also keep in mind the months of conversational background that led to the current state. There are requirements that we were dealing with that your use case may not have.

I followed along with the conversation and I have the utmost respect for all the effort everyone put in. At the time, I wasn't in a place to contribute to that particular conversation. I'm possibly the first person to try to do something with the output other than reporting validation results. I bring up my use case not to make this about me, but to show an example of where the current output format is insufficient. I think that's valuable feedback, but if it's not helpful, I'm happy to drop it.

gregsdennis

gregsdennis commented on Aug 30, 2020

@gregsdennis
MemberAuthor

Why is allOf producing any annotations at all? - @karenetheridge

This question isn't saying that allOf would be generating, or even carrying, annotations. It's merely asking whether the passing subschemas of a failed allOf need to be included in the output. In fact, it explicitly would not return any annotations (generated or passed along) because it failed.

I'm asking for an overhaul of the output. I think the structure of the node as well as the tree structure (of each format) of the output should be revisited. Question 1 looks at the node structure, whereas questions 2 & 3 look more at the tree structure.

32 remaining items

gregsdennis

gregsdennis commented on Oct 5, 2021

@gregsdennis
MemberAuthor

Working branch is draft-next-output.

jdesrosiers

jdesrosiers commented on Oct 5, 2021

@jdesrosiers
Member

Unless I'm missing some contradicting wording somewhere else, detailed is not a SHOULD by itself. Implementing flag or basic covers the requirement as well.

An implementation SHOULD provide at least one of the "flag", "basic", or "detailed" format and MAY provide the "verbose" format. If it provides one or more of the "detailed" or "verbose" formats, it MUST also provide the "flag" format.

gregsdennis

gregsdennis commented on Oct 5, 2021

@gregsdennis
MemberAuthor

🤔 yeah, I see what you're getting at. I'll do some digging to see where we decided that. Do you have an opinion on this requirement?

gregsdennis

gregsdennis commented on Oct 5, 2021

@gregsdennis
MemberAuthor
jdesrosiers

jdesrosiers commented on Oct 5, 2021

@jdesrosiers
Member

I don't have a strong opinion. I'd suggest that implementations MUST support flag and SHOULD support basic or detailed. Anything beyond that is just nice-to-have.

gregsdennis

gregsdennis commented on Oct 6, 2021

@gregsdennis
MemberAuthor

Given that I'm expanding detailed into detailed-schema (what we have currently) and detailed-instance (an instance-based format), would you say that implementations SHOULD provide just one of them? It seems odd to prefer one over another, whether by the spec or the implementation.

I do like requiring flag, though.

jdesrosiers

jdesrosiers commented on Oct 6, 2021

@jdesrosiers
Member

I agree that we probably shouldn't prefer one over the other. So, I'd suggest: MUST support flag and SHOULD support basic, detailed-schema, or detailed-instance. I would personally (not in the spec) encourage implementations to support all of the output formats because each is useful in different situations, but as far as minimum requirements, at least one format with error details should be sufficient.

karenetheridge

karenetheridge commented on Oct 6, 2021

@karenetheridge
Member

I haven't had a chance to go over all of this in detail and respond to all the points that I want to respond to -- but I wonder if it might be worthwhile moving to a community discussion so we can have threaded conversations about the various points? Many of them are not directly related, and some require a deep dive to understand motivations and to examine tradeoffs. (We might need an ADR for these, too!)

gregsdennis

gregsdennis commented on Oct 7, 2021

@gregsdennis
MemberAuthor

Sounds good. I'll copy this to a discussion, making edits for the conversation @jdesrosiers and I have had so far. We can create threads for each item.

gregsdennis

gregsdennis commented on Oct 7, 2021

@gregsdennis
MemberAuthor

We don't have GH discussions on in this repo so I opened one in the Community repo with the General topic: json-schema-org/community#63

gregsdennis

gregsdennis commented on Sep 26, 2022

@gregsdennis
MemberAuthor

I think this is basically resolved at this point. Closing for now.

moved this from Closed to Merged in Feature: Standardized Outputon Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Closed

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @karenetheridge@jdesrosiers@handrews@gregsdennis@ssilverman

        Issue actions

          Restructuring of output formatting · Issue #973 · json-schema-org/json-schema-spec