Description
NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.
We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.
The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.
This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false
use cases that do not work well with our existing modularity and re-use features.
TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.
Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.
EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.
- Level 1: Basic media type functionality. Identify and link schemas, allow for basic modularity and re-use
- Level 2: Full structural access. Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
- Level 3: Assertions. Evaluate the assertions within a schema object without regard to the contents of any other schema object
- Level 4: Annotations. Collect all annotations that apply to a given location and combine the values as defined by each keyword
- Level 5: Deferred Assertions. Evaluate these assertions across all subschemas that apply to a given location
- Level 6: Deferred Annotations. Collect annotations and combine them with existing level 4 results as specified by the keyword. Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas
A general JSON Schema processing model
With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.
NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.
NOTE 2: Even if this approach is used, the steps are not executed linearly. $ref
must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.
- Process schema linking and URI base keywords (
$schema
,$id
,$ref
,definitions
as discussed in Move "definitions" to core (as "$defs"?) #512) - Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
- Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
- Collect annotations from the remaining relevant subschemas
There is a basic example in one of the comments.
Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.
Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.
Steps 3 and 4 are where things get more interesting.
Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).
Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).
Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.
So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).
To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:
Deferred processing
To solve the re-use problems I propose defining a step 5:
- Process additional assertions (a.k.a. deferred assertions) that may make use of all subschemas that are relevant at the end of step 4. Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.
EDIT: The proposal was originally called unknownProperties
, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behavior unevaluatedProperties
instead. But that name does not otherwise appear until much later in this issue.
This easily allows a keyword to implement "ban unknown properties", among other things. We can define unevaluatedProperties
to be a deferred assertion analogous to additionalProperties
. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas of properties
and patternProperties
.
There is an example of how unevaluatedProperties
, called unknownProperties
in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.
We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be unevaluatedItems
, which would be analogous to additionalItems
except that it would apply to elements after the maximum length items
array across all relevant schemas. (I don't think anyone's ever asked for this, though).
Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like deferredDefault
, which would override any/all default
values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).
Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.
Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).
Schema transforms
In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as $merge
or $patch
would be added as a step 1.5, as they are processed after $ref
but before all other keywords.
These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals ($spread
, $use
, single-level overrides) can be described as limited versions of $merge
and/or $patch
, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.
It's not clear to me how schema transform keywords work with the idea that $ref
is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).
[EDIT: @epoberezkin has proposed a slightly different $merge
syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]
If $ref
is lazily replaced with its target (with $id
and $schema
adjusted accordingly), then transforms are straightforward. However, we currently forbid changing $schema
while processing a schema document, and merging schema objects that use different $schema
values seems impossible to do correctly in the general case.
Imposing a restriction of identical $schema
s seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.
On the other hand, if $ref
is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different $schema
values but it is not at all clear to me how schema transforms would apply.
@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?
Conclusions
Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.
Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing $ref
, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.
I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.
I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)
Activity
additionalProperties
/additionalItems
) #313erayd commentedon Nov 28, 2017
I like deferred keywords as a concept, but they do not obviate my need for schema transforms.
My primary use-case for transforms is re-use of a schema fragment, with the ability to override some of the keywords. To take a trivial example, using
{"type": "integer", "maximum": 5}
, but with a higher maximum, is currently impossible and requires a lot of copy / paste that reduces maintainability.erayd commentedon Nov 28, 2017
Also for the record, I think that
$ref
should not be related in any way to schema transforms. It should be an immutable delegation (i.e. essentially a black-box function call).handrews commentedon Nov 28, 2017
@erayd I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema. Although that view is certainly debatable.
To apply arbitrary transforms to JSON like that has nothing to do with JSON Schema. There is no awareness needed of the source or target being schemas or having particular keyword behavior. You're just manipulating JSON text at a raw level. That is why I see it as out of scope- there is simply nothing that requires it to be part of JSON Schema at all.
This is different from
$ref
where it's simply not possible to have a usable system without some mechanism for modularity and cyclic references. The media type would be useless for any non-trivial purpose without it. However, it's always possible to refactor to avoid schema transforms, and frankly if anyone submitted a PR on a schema doing "re-use" by what is essentially textual editing, I'd send it back.The violation of the opacity of
$ref
(which it seems at least you, @epoberezkin, and me all prefer) means that it is inviting a huge class of unpredictable errors due to unexpected changes on the target side. Your result across a regular delegation-style$ref
may change in ways that you can't see or predict, but you have established an interface contract- I am referring to whatever functionality is identified by the target URI.With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me.
handrews commentedon Nov 28, 2017
Hopefully others can talk about how their use cases line up with these proposals. The primary use cases that I remember (OO-style inheritance for strictly typed systems, and disambiguating multiple annotations) can both be solved by deferred keywords.
So I would be particularly interested in use cases that stop short of "I want to be able to do arbitrary transforms regardless of schema-ness" but are beyond what can be addressed with deferred keywords.
234 remaining items
handrews commentedon Jan 5, 2018
@erayd
I'm getting there. Let me sort out with @epoberezkin what the principles that he's concerned about mean first so that I can either address those or change the proposal to reflect them if needed.
handrews commentedon Jan 5, 2018
@epoberezkin
I'm not entirely sure that I follow this. The result of
allOf
,anyOf
,oneOf
,not
, andif
/then
/else
depend on their subschemas, which are independent of whether the instance is an object, array, or something else.Let me see if I can state this a different way to ensure that I'm understanding: Immediate values in the sense of object property names and array indices are available for static analysis (this is how
additionalProperties
andadditionalItems
work). However, the contents of subschemas, whether they are immediate values of keywords or are within an object or an array, are off-limits from static examination.I'm saying "static examination" because we do agree (I think?) that the dynamic results of a subschema are a factor in the results of the keyword (that's kind of the whole point of subschemas, right?).
I'm going to post later about the context-independence part, some good new information for me there that I need to think through- thanks!
epoberezkin commentedon Jan 5, 2018
By adding "direct", I mean that the keyword cannot depend on sub-sub-schemas (we don't have a precedent of it at the moment). EDIT: by “array items and property values” I meant that the subschemas of “allOf”, for example, are “array items” and the subschemas of “properties” are “property value” (of the value of “properties” keyword). Sorry for the confusion.
We talk about the same thing (I think :), I just wanted to clarify.
Correct, that is covered by 2.ii and 2.v.
Thank you
handrews commentedon Jan 5, 2018
Awesome- I am on board with this.
Still working on writing up context-independence and addressing your concerns about depending on property/item values.
handrews commentedon Jan 6, 2018
@epoberezkin regarding context-independence:
(I don't actually remember what was said about
$data
anymore so I'm skipping that bit)I think the key thing here is that I'm making a distinction between:
The runtime result of evaluating a subschema of course depends on both the subschema's contents and the instance data. But the subschema contents and instance data remain opaque for the purposes of evaluating the parent schema object.
It may be possible to infer things about the subschema contents based on those results, and on the immediate property names / array indices that are fair game to examine, but that's not the same thing as actually looking at the subschema contents and instance data as a separate process from evaluating the subschema.
Does this make sense? If we're just depending on results then both of these objects as subschemas:
{"patternProperties": {"^.*$": {"type": "string"}}
and{"additionalProperties": {"type": "string"}}
have the same behavior (every object property is evaluated, and every object property's value must be a string).In this view, we are not allowed to look into the subschema and see whether the result was achieved with
additionalProperties
or with apatternProperties
that matches all possible names.So I'm claiming that if we are only using results, then we are still context-independent. Does that make sense?
epoberezkin commentedon Jan 6, 2018
Yes, as long as by "results" we mean "boolean result of assertions", i.e. valid or invalid.
The reason for that limitation is that if you arbitraryly define validation results, then they can include something which is either "context" (i.e. data values) or something that depends on the "context", so we are no longer context independent.
The way annotation collection is defined makes it exactly the case, collected annotations are context dependent.
EDIT: actually annotations add the parts of the schema itself, so making a keyword dependent on annotations (or something similar) violates shallowness, not context-independence.
epoberezkin commentedon Jan 6, 2018
@handrews Another way to explain the problem I see with this proposal is related to "applicability" concept and how this proposal changes it. Regardless which section of the spec we put some keywords in, we have keywords that apply subschemas to either child or current location of the data instance. They, by definition (section 3.1), belong to the applicability group.
Currently the locations in the data instance to which subschemas should be applied can be determined by:
(1). the keyword logic, as defined in the spec
(2). the keyword value, excluding subschemas
(3). sibling keywords values, excluding subschemas
(4). data structure, i.e. property names and indices of the data instance (but not values of properties and array items).
So applicability keywords have stronger context-independence than validation keywords (that need data values).
To illustrate:
etc.
The problem with the proposed keyword is that it makes applicability dependent on data values, as data structure is no longer sufficient to determine whether the subschema of unwhateverProperties will be applied to some child instance.
Do you follow this argument or something needs clarifying? Do you see the problem?
I believe that we can and should solve the problems at hand (extending schemas, avoiding typos in property names, etc.) without changing how applicability works.
handrews commentedon Jan 7, 2018
As with other controversial issues right now, I'm locking this rather than responding further until people who are currently ill and/or traveling can get back and catch up.
handrews commentedon Jan 10, 2018
I have filed #530 for nailing down how annotations are collected, since it doesn't really have anything to do with this issue. We may end up using that process, but it's not at all specific to or driven by this concept.
@erayd you'll get your pseudocode there (whether it ends up being relevant here or not- if not, we'll work out whatever we need for this issue here).
handrews commentedon Mar 2, 2018
I've been talking with the OpenAPI Technical Steering Committee, and one thing that's going on with their project is that the schema for version 3.0 of their specification (the schema for the OAS file, not the schemas used in the file) has been stalled for months.
The main reason it stalled is concern over the massive duplication required to get
"additionalProperties": false
in all of the situations where the OAS 3.0 specification forbids additional properties. Rather than usingallOf
andoneOf
to avoid duplication, every variation on a schema must be entirely listed out so thatadditionalProperties
can have the desired effect.I have refactored the schema to use
allOf
,oneOf
, andunevaluatedProperties
, which not only dramatically shrank the file (1500 lines down to 845) but allowed a different approach consisting of a number of "mix-in" schemas grouping commonly used fields, which are then referenced throughout a set of object schemas.See the refactored schema
Note that there is a link to the original PR in the comment on the gist.
I think that this is pretty compelling evidence in favor of
unevaluatedProperties
. None of the other solutions proposed here could accomplish this due to the heavy use ofoneOf
. OpenAPI is a well-established, widely used project, and they have found the current situation to be a enough of a problem to leave the schema unfinished for months.philsturgeon commentedon Mar 2, 2018
This implementation of the OpenAPI spec in JSON Schema provides a powerful example of the problem at hand. Multiple different people have been discussing multiple different problems, and asking for examples of the other problems, talking past each other and generally this thread got to an unreadable point due to this confusion.
Now we have this very specific real-world example solving the problem we're trying to solve, other problems can be discussed in other issues and potentially solved in other threads.
I think we can move along now, closing this issue, happy and content we have a great example. We have fundamentally solved a giant issue with JSON Schema., and that's fantastic news.
Relequestual commentedon Mar 2, 2018
This is a clear solution to a real problem which has effected aspects of an important project. Let's fix this. Let's go with
unevaluatedProperties
!Can you file a new issues specifically for that option? Then we can move directly to pull request. I feel the general consensus is we need this.
Unrelared, hello from the UK! ❄️ ❄️ ❄️ ❄️