Skip to content

Flexible re-use: deferred keywords vs schema transforms #515

Closed
@handrews

Description

@handrews

NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.

We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.

The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.


This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false use cases that do not work well with our existing modularity and re-use features.


TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.

Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.

EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.

  • Level 1: Basic media type functionality. Identify and link schemas, allow for basic modularity and re-use
  • Level 2: Full structural access. Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
  • Level 3: Assertions. Evaluate the assertions within a schema object without regard to the contents of any other schema object
  • Level 4: Annotations. Collect all annotations that apply to a given location and combine the values as defined by each keyword
  • Level 5: Deferred Assertions. Evaluate these assertions across all subschemas that apply to a given location
  • Level 6: Deferred Annotations. Collect annotations and combine them with existing level 4 results as specified by the keyword. Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas

A general JSON Schema processing model

With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.

NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.

NOTE 2: Even if this approach is used, the steps are not executed linearly. $ref must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.

  1. Process schema linking and URI base keywords ($schema, $id, $ref, definitions as discussed in Move "definitions" to core (as "$defs"?) #512)
  2. Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
  3. Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
  4. Collect annotations from the remaining relevant subschemas

There is a basic example in one of the comments.

Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.

Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.

Steps 3 and 4 are where things get more interesting.

Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).

Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).

Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.

So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).

To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:

Deferred processing

To solve the re-use problems I propose defining a step 5:

  • Process additional assertions (a.k.a. deferred assertions) that may make use of all subschemas that are relevant at the end of step 4. Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.

EDIT: The proposal was originally called unknownProperties, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behavior unevaluatedProperties instead. But that name does not otherwise appear until much later in this issue.

This easily allows a keyword to implement "ban unknown properties", among other things. We can define unevaluatedProperties to be a deferred assertion analogous to additionalProperties. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas of properties and patternProperties.

There is an example of how unevaluatedProperties, called unknownProperties in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.

We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be unevaluatedItems, which would be analogous to additionalItems except that it would apply to elements after the maximum length items array across all relevant schemas. (I don't think anyone's ever asked for this, though).

Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like deferredDefault, which would override any/all default values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).


Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.

Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).

Schema transforms

In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as $merge or $patch would be added as a step 1.5, as they are processed after $ref but before all other keywords.

These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals ($spread, $use, single-level overrides) can be described as limited versions of $merge and/or $patch, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.


It's not clear to me how schema transform keywords work with the idea that $ref is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).

[EDIT: @epoberezkin has proposed a slightly different $merge syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]

If $ref is lazily replaced with its target (with $id and $schema adjusted accordingly), then transforms are straightforward. However, we currently forbid changing $schema while processing a schema document, and merging schema objects that use different $schema values seems impossible to do correctly in the general case.

Imposing a restriction of identical $schemas seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.

On the other hand, if $ref is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different $schema values but it is not at all clear to me how schema transforms would apply.

@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?

Conclusions

Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.

Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing $ref, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.

I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.

I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)

Activity

added this to the draft-08 milestone on Nov 28, 2017
erayd

erayd commented on Nov 28, 2017

@erayd

I like deferred keywords as a concept, but they do not obviate my need for schema transforms.

My primary use-case for transforms is re-use of a schema fragment, with the ability to override some of the keywords. To take a trivial example, using {"type": "integer", "maximum": 5}, but with a higher maximum, is currently impossible and requires a lot of copy / paste that reduces maintainability.

erayd

erayd commented on Nov 28, 2017

@erayd

Also for the record, I think that $ref should not be related in any way to schema transforms. It should be an immutable delegation (i.e. essentially a black-box function call).

handrews

handrews commented on Nov 28, 2017

@handrews
ContributorAuthor

@erayd I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema. Although that view is certainly debatable.

To apply arbitrary transforms to JSON like that has nothing to do with JSON Schema. There is no awareness needed of the source or target being schemas or having particular keyword behavior. You're just manipulating JSON text at a raw level. That is why I see it as out of scope- there is simply nothing that requires it to be part of JSON Schema at all.

This is different from $ref where it's simply not possible to have a usable system without some mechanism for modularity and cyclic references. The media type would be useless for any non-trivial purpose without it. However, it's always possible to refactor to avoid schema transforms, and frankly if anyone submitted a PR on a schema doing "re-use" by what is essentially textual editing, I'd send it back.

The violation of the opacity of $ref (which it seems at least you, @epoberezkin, and me all prefer) means that it is inviting a huge class of unpredictable errors due to unexpected changes on the target side. Your result across a regular delegation-style $ref may change in ways that you can't see or predict, but you have established an interface contract- I am referring to whatever functionality is identified by the target URI.

With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me.

handrews

handrews commented on Nov 28, 2017

@handrews
ContributorAuthor

Hopefully others can talk about how their use cases line up with these proposals. The primary use cases that I remember (OO-style inheritance for strictly typed systems, and disambiguating multiple annotations) can both be solved by deferred keywords.

So I would be particularly interested in use cases that stop short of "I want to be able to do arbitrary transforms regardless of schema-ness" but are beyond what can be addressed with deferred keywords.

234 remaining items

handrews

handrews commented on Jan 5, 2018

@handrews
ContributorAuthor

@erayd

Am I correct that your proposal includes bubbling up property names as part of the result of evaluating those subschemas? I thought that's what you were proposing, but that's necessarily content-dependent, which would seem to contradict your point above.

I have said this earlier, but it feels worth repeating - I think we really, really need to have some discussion around implementation concepts before trying to put anything in the spec. There still seems to be a fair bit of confusion around what is actually intended, and discussing implementation should hopefully get everyone on the same page pretty quickly - code (or pseudocode) is not woolly and open to interpretation the way English can be.

I'm getting there. Let me sort out with @epoberezkin what the principles that he's concerned about mean first so that I can either address those or change the proposal to reflect them if needed.

handrews

handrews commented on Jan 5, 2018

@handrews
ContributorAuthor

@epoberezkin

2.ii. The results of any direct subschemas of k1, direct includes array items and property values.

I'm not entirely sure that I follow this. The result of allOf, anyOf, oneOf, not, and if/then/else depend on their subschemas, which are independent of whether the instance is an object, array, or something else.

2.iv. The immediate values of k2, k3, ... kn, including the names of the properties in case their values are objects with multiple subschemas, but not the contents of any of their subschemas and excluding values of keywords that are schemas.

Let me see if I can state this a different way to ensure that I'm understanding: Immediate values in the sense of object property names and array indices are available for static analysis (this is how additionalProperties and additionalItems work). However, the contents of subschemas, whether they are immediate values of keywords or are within an object or an array, are off-limits from static examination.

I'm saying "static examination" because we do agree (I think?) that the dynamic results of a subschema are a factor in the results of the keyword (that's kind of the whole point of subschemas, right?).

I'm going to post later about the context-independence part, some good new information for me there that I need to think through- thanks!

epoberezkin

epoberezkin commented on Jan 5, 2018

@epoberezkin
Member

I'm not entirely sure that I follow this. The result of allOf, anyOf, oneOf, not, and if/then/else depend on their subschemas.

By adding "direct", I mean that the keyword cannot depend on sub-sub-schemas (we don't have a precedent of it at the moment). EDIT: by “array items and property values” I meant that the subschemas of “allOf”, for example, are “array items” and the subschemas of “properties” are “property value” (of the value of “properties” keyword). Sorry for the confusion.

Let me see if I can state this a different way to ensure that I'm understanding: Immediate values in the sense of object property names and array indices are available for static analysis (this is how additionalProperties and additionalItems work). However, the contents of subschemas, whether they are immediate values of keywords or are within an object or an array, are off-limits from static examination.

We talk about the same thing (I think :), I just wanted to clarify.

I'm saying "static examination" because we do agree (I think?) that the dynamic results of a subschema are a factor in the results of the keyword (that's kind of the whole point of subschemas, right?).

Correct, that is covered by 2.ii and 2.v.

I'm going to post later about the context-independence part

Thank you

handrews

handrews commented on Jan 5, 2018

@handrews
ContributorAuthor

By adding "direct", I mean that the keyword cannot depend on sub-sub-schemas

Awesome- I am on board with this.

Still working on writing up context-independence and addressing your concerns about depending on property/item values.

handrews

handrews commented on Jan 6, 2018

@handrews
ContributorAuthor

@epoberezkin regarding context-independence:

By context-independence I meant independence of the schema applicability from the property/item values - currently the applicability only relies on property names and item indices (i.e. on data structure), but not on their values. With this proposal, the applicability starts depending on property and item values.

(I don't actually remember what was said about $data anymore so I'm skipping that bit)

I think the key thing here is that I'm making a distinction between:

  • A keyword's immediate non-subschema values (including the property names and array indices for objects or arrays of subschemas [OK to use]
  • The contents of subschemas as would be seen by static examination of the schema document(s) [Not OK to use]
  • The runtime result of evaluating subschemas [OK to use]

The runtime result of evaluating a subschema of course depends on both the subschema's contents and the instance data. But the subschema contents and instance data remain opaque for the purposes of evaluating the parent schema object.

It may be possible to infer things about the subschema contents based on those results, and on the immediate property names / array indices that are fair game to examine, but that's not the same thing as actually looking at the subschema contents and instance data as a separate process from evaluating the subschema.

Does this make sense? If we're just depending on results then both of these objects as subschemas: {"patternProperties": {"^.*$": {"type": "string"}} and {"additionalProperties": {"type": "string"}} have the same behavior (every object property is evaluated, and every object property's value must be a string).

In this view, we are not allowed to look into the subschema and see whether the result was achieved with additionalProperties or with a patternProperties that matches all possible names.

So I'm claiming that if we are only using results, then we are still context-independent. Does that make sense?

epoberezkin

epoberezkin commented on Jan 6, 2018

@epoberezkin
Member

So I'm claiming that if we are only using results, then we are still context-independent. Does that make sense?

Yes, as long as by "results" we mean "boolean result of assertions", i.e. valid or invalid.

The reason for that limitation is that if you arbitraryly define validation results, then they can include something which is either "context" (i.e. data values) or something that depends on the "context", so we are no longer context independent.

The way annotation collection is defined makes it exactly the case, collected annotations are context dependent.

EDIT: actually annotations add the parts of the schema itself, so making a keyword dependent on annotations (or something similar) violates shallowness, not context-independence.

epoberezkin

epoberezkin commented on Jan 6, 2018

@epoberezkin
Member

@handrews Another way to explain the problem I see with this proposal is related to "applicability" concept and how this proposal changes it. Regardless which section of the spec we put some keywords in, we have keywords that apply subschemas to either child or current location of the data instance. They, by definition (section 3.1), belong to the applicability group.

Currently the locations in the data instance to which subschemas should be applied can be determined by:
(1). the keyword logic, as defined in the spec
(2). the keyword value, excluding subschemas
(3). sibling keywords values, excluding subschemas
(4). data structure, i.e. property names and indices of the data instance (but not values of properties and array items).

So applicability keywords have stronger context-independence than validation keywords (that need data values).

To illustrate:

  • allOf, anyOf, oneOf, not, if/then/else - only have (1) and they apply all their subschemas to the current data instance
  • properties - need (1), (2), and (4), it applies subschemas to corresponding child instances
  • patternProperties - need (1), (2) and (4), it applies subschemas to child instances where property names matches the patterns
  • additionalProperties - need (1), (2), (3) and (4)
    etc.

The problem with the proposed keyword is that it makes applicability dependent on data values, as data structure is no longer sufficient to determine whether the subschema of unwhateverProperties will be applied to some child instance.

Do you follow this argument or something needs clarifying? Do you see the problem?

I believe that we can and should solve the problems at hand (extending schemas, avoiding typos in property names, etc.) without changing how applicability works.

handrews

handrews commented on Jan 7, 2018

@handrews
ContributorAuthor

As with other controversial issues right now, I'm locking this rather than responding further until people who are currently ill and/or traveling can get back and catch up.

locked as too heated and limited conversation to collaborators on Jan 7, 2018
handrews

handrews commented on Jan 10, 2018

@handrews
ContributorAuthor

I have filed #530 for nailing down how annotations are collected, since it doesn't really have anything to do with this issue. We may end up using that process, but it's not at all specific to or driven by this concept.

@erayd you'll get your pseudocode there (whether it ends up being relevant here or not- if not, we'll work out whatever we need for this issue here).

handrews

handrews commented on Mar 2, 2018

@handrews
ContributorAuthor

I've been talking with the OpenAPI Technical Steering Committee, and one thing that's going on with their project is that the schema for version 3.0 of their specification (the schema for the OAS file, not the schemas used in the file) has been stalled for months.

The main reason it stalled is concern over the massive duplication required to get "additionalProperties": false in all of the situations where the OAS 3.0 specification forbids additional properties. Rather than using allOf and oneOf to avoid duplication, every variation on a schema must be entirely listed out so that additionalProperties can have the desired effect.

I have refactored the schema to use allOf, oneOf, and unevaluatedProperties, which not only dramatically shrank the file (1500 lines down to 845) but allowed a different approach consisting of a number of "mix-in" schemas grouping commonly used fields, which are then referenced throughout a set of object schemas.

See the refactored schema

Note that there is a link to the original PR in the comment on the gist.

I think that this is pretty compelling evidence in favor of unevaluatedProperties. None of the other solutions proposed here could accomplish this due to the heavy use of oneOf. OpenAPI is a well-established, widely used project, and they have found the current situation to be a enough of a problem to leave the schema unfinished for months.

philsturgeon

philsturgeon commented on Mar 2, 2018

@philsturgeon
Collaborator

This implementation of the OpenAPI spec in JSON Schema provides a powerful example of the problem at hand. Multiple different people have been discussing multiple different problems, and asking for examples of the other problems, talking past each other and generally this thread got to an unreadable point due to this confusion.

Now we have this very specific real-world example solving the problem we're trying to solve, other problems can be discussed in other issues and potentially solved in other threads.

I think we can move along now, closing this issue, happy and content we have a great example. We have fundamentally solved a giant issue with JSON Schema., and that's fantastic news.

Relequestual

Relequestual commented on Mar 2, 2018

@Relequestual
Member

This is a clear solution to a real problem which has effected aspects of an important project. Let's fix this. Let's go with unevaluatedProperties!

Can you file a new issues specifically for that option? Then we can move directly to pull request. I feel the general consensus is we need this.

Unrelared, hello from the UK! ❄️ ❄️ ❄️ ❄️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @philsturgeon@erayd@Julian@awwright@Relequestual

        Issue actions

          Flexible re-use: deferred keywords vs schema transforms · Issue #515 · json-schema-org/json-schema-spec