Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Cleanup and fixes for structuring #162

Merged
merged 1 commit into from
Jun 22, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
735 changes: 445 additions & 290 deletions source/structuring.rst
Original file line number Diff line number Diff line change
@@ -14,402 +14,557 @@ functions is better than copying-and-pasting duplicate bits of code
everywhere they are used. Likewise in JSON Schema, for anything but
the most trivial schema, it's really useful to structure the schema
into parts that can be reused in a number of places. This chapter
will present some practical examples that use the tools available for
reusing and structuring schemas.
will present the tools available for reusing and structuring schemas
as well as some practical examples that use those tools.

Reuse
-----
.. index::
single: schema identification
single: structuring; schema identification

For this example, let's say we want to define a customer record, where
each customer may have both a shipping and a billing address.
Addresses are always the same---they have a street address, city and
state---so we don't want to duplicate that part of the schema
everywhere we want to store an address. Not only would that make the
schema more verbose, but it makes updating it in the future more
difficult. If our imaginary company were to start doing international
business in the future and we wanted to add a country field to all the
addresses, it would be better to do this in a single place rather than
everywhere that addresses are used.
.. _schema-identification:

So let's start with the schema that defines an address::
Schema Identification
---------------------

{
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
Like any other code, schemas are easier to maintain if they can be
broken down into logical units that reference each other as necessary.
In order to reference a schema, we need a way to identify a schema.
Schema documents are identified by non-relative URIs.

Since we are going to reuse this schema, it is customary (but not
required) to put it in the parent schema under a key called
``definitions``::
Schema documents are not required to have an identifier, but
you will need one if you want to reference one schema from
another. In this document, we will refer to schemas with no
identifier as "anonymous schemas".

{
"definitions": {
"address": {
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
}
}
In the following sections we will see how the "identifier" for a
schema is determined.

.. note::
URI terminology can sometimes be unintuitive. In this document, the
following definitions are used.

- **URI** `[1]
<https://datatracker.ietf.org/doc/html/rfc3986#section-3>`__ or
**non-relative URI**: A full URI containing a scheme (``https``).
It may contain a URI fragment (``#foo``). Sometimes this document
will use "non-relative URI" to make it extra clear that relative
URIs are not allowed.
- **relative reference** `[2]
<https://datatracker.ietf.org/doc/html/rfc3986#section-4.2>`__: A
partial URI that does not contain a scheme (``https``). It may
contain a fragment (``#foo``).
- **URI-reference** `[3]
<https://datatracker.ietf.org/doc/html/rfc3986#section-4.1>`__: A
relative reference or non-relative URI. It may contain a URI
fragment (``#foo``).
- **absolute URI** `[4]
<https://datatracker.ietf.org/doc/html/rfc3986#section-4.3>`__ A
full URI containing a scheme (``https``) but not a URI fragment
(``#foo``).

.. note::
Even though schemas are identified by URIs, those identifiers are
not necessarily network-addressable. They are just identifiers.
Generally, implementations don't make HTTP requests (``https://``)
or read from the file system (``file://``) to fetch schemas.
Instead, they provide a way to load schemas into an internal schema
database. When a schema is referenced by it's URI identifier, the
schema is retrieved from the internal schema database.

.. index::
single: $ref
single: JSON Pointer
single: structuring; subschema identification; JSON Pointer

We can then refer to this schema snippet from elsewhere using the
``$ref`` keyword. The easiest way to describe ``$ref`` is that it
gets logically replaced with the thing that it points to. So, to
refer to the above, we would include::
.. _json-pointer:

{ "$ref": "#/definitions/address" }
JSON Pointer
~~~~~~~~~~~~

This can be used anywhere a schema is expected. You will always use ``$ref`` as
the only key in an object: any other keys you put there will be ignored by the
validator.
In addition to identifying a schema document, you can also identify
subschemas. The most common way to do that is to use a `JSON Pointer
<https://tools.ietf.org/html/rfc6901>`__ in the URI fragment that
points to the subschema.

The value of ``$ref`` is a URI-reference, and the part after ``#`` sign (the
"fragment" or "named anchor") is in a format called `JSON Pointer
<https://tools.ietf.org/html/rfc6901>`__.
A JSON Pointer describes a slash-separated path to traverse the keys
in the objects in the document. Therefore,
``/properties/street_address`` means:

.. note::
JSON Pointer aims to serve the same purpose as `XPath
<http://www.w3.org/TR/xpath/>`_ from the XML world, but it is much
simpler.
1) find the value of the key ``properties``
2) within that object, find the value of the key ``street_address``

The URI
``https://example.com/schemas/address#/properties/street_address``
identifies the highlighted subschema in the following schema.

.. schema_example::

{
"$id": "https://example.com/schemas/address",

"type": "object",
"properties": {
"street_address":
* { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}

.. index::
single: $id
single: named anchors
single: structuring; subschema identification; $id

If you're using a definition from the same document, the ``$ref`` value begins
with the pound symbol (``#``). Following that, the slash-separated items traverse
the keys in the objects in the document. Therefore, in our example
``"#/definitions/address"`` means:
.. _anchor:

1) go to the root of the document
2) find the value of the key ``"definitions"``
3) within that object, find the value of the key ``"address"``
Named Anchors
~~~~~~~~~~~~~

``$ref`` can resolve to a URI that references another file, so if you prefer to
include your definitions in separate files, you can also do that. For
example::
A less common way to identify a subschema is to create a named anchor
in the schema using the ``$id`` keyword and using that name in the URI
fragment. When the ``$id`` keyword contains a URI fragment, the
fragment defines a named anchor using the value of the fragment. Named
anchors must start with a letter followed by any number of letters,
digits, ``-``, ``_``, ``:``, or ``.``.

{ "$ref": "definitions.json#/address" }
.. draft_specific::

--Draft 4
In Draft 4, ``$id`` is just ``id`` (without the dollar sign).

would load the address schema from another file residing alongside
this one.
.. note::
If a named anchor is defined that doesn't follow these naming
rules, then behavior is undefined. Your anchors might work in some
implementation, but not others.

Now let's put this together and use our address schema to create a
schema for a customer:
The URI ``https://example.com/schemas/address#street_address``
identifies the subschema on the highlighted part of the following
schema.

.. schema_example::

{
"$schema": "http://json-schema.org/draft-07/schema#",

"definitions": {
"address": {
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
},
"$id": "https://example.com/schemas/address",

"type": "object",

"properties": {
"billing_address": { "$ref": "#/definitions/address" },
"shipping_address": { "$ref": "#/definitions/address" }
}
}
--
{
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC"
"street_address":
* {
* "$id": "#street_address",
* "type": "string"
* },
"city": { "type": "string" },
"state": { "type": "string" }
},
"billing_address": {
"street_address": "1st Street SE",
"city": "Washington",
"state": "DC"
}
"required": ["street_address", "city", "state"]
}

.. note::
JSON Schema doesn't define how ``$id`` should be interpreted when
it contains both fragment and non-fragment URI parts. Therefore,
when setting a named anchor, you should not use non-fragment URI
parts in the URI-reference.

Even though the value of a ``$ref`` is a URI-reference, it is not a network
locator, only an identifier. This means that the schema doesn't need to be
accessible at the resolved URI, but it may be. It is basically up to the
validator implementation how external schema URIs will be handled, but one
should not assume the validator will fetch network resources indicated in
``$ref`` values.
.. index::
single: base URI
single: structuring; base URI

Recursion
`````````
.. _base-uri:

``$ref`` elements may be used to create recursive schemas that refer to themselves.
For example, you might have a ``person`` schema that has an array of ``children``, each of which are also ``person`` instances.
Base URI
--------

.. schema_example::
Using non-relative URIs can be cumbersome, so any URIs used in
JSON Schema can be URI-references that resolve against the schema's
base URI resulting in a non-relative URI. This section describes how a
schema's base URI is determined.

{
"$schema": "http://json-schema.org/draft-07/schema#",
.. note::
Base URI determination and relative reference resolution is defined
by `RFC-3986
<https://datatracker.ietf.org/doc/html/rfc3986#section-5>`__. If
you are familiar with how this works in HTML, this section should
feel very familiar.

"definitions": {
"person": {
"type": "object",
"properties": {
"name": { "type": "string" },
"children": {
"type": "array",
* "items": { "$ref": "#/definitions/person" },
"default": []
}
}
}
},
.. index::
single: retrieval URI
single: structuring; base URI; retrieval URI

"type": "object",
.. _retrieval-uri:

"properties": {
"person": { "$ref": "#/definitions/person" }
}
}
--
// A snippet of the British royal family tree
{
"person": {
"name": "Elizabeth",
"children": [
{
"name": "Charles",
"children": [
{
"name": "William",
"children": [
{ "name": "George" },
{ "name": "Charlotte" }
]
},
{
"name": "Harry"
}
]
}
]
}
}
Retrieval URI
~~~~~~~~~~~~~

The URI used to fetch a schema is known as the "retrieval URI". It's
often possible to pass an anonymous schema to an implementation in
which case that schema would have no retrieval URI.

Above, we created a schema that refers to another part of itself, effectively
creating a "loop" in the validator, which is both allowed and useful. Note,
however, that a loop of ``$ref`` schemas referring to one another could cause an
infinite loop in the resolver, and is explicitly disallowed.
Let's assume a schema is referenced using the URI
``https://example.com/schemas/address`` and the following schema is
retrieved.

.. schema_example::

{
"definitions": {
"alice": {
"anyOf": [
{ "$ref": "#/definitions/bob" }
]
},
"bob": {
"anyOf": [
{ "$ref": "#/definitions/alice" }
]
}
}
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}

The base URI for this schema is the same as the retrieval URI,
``https://example.com/schemas/address``.

.. index::
single: $id
single: $id
single: canonical URI
single: structuring; base URI; $id

.. _id:

The $id property
----------------
$id
~~~

The ``$id`` property is a URI-reference that serves two purposes:
You can set the base URI using the ``$id`` keyword. The value of
``$id`` is a URI-reference that resolves against the `retrieval-uri`.
The resulting URI is the base URI for the schema.

- It declares a unique identifier for the schema.
.. draft_specific::

--Draft 4
In Draft 4, ``$id`` is just ``id`` (without the dollar sign).

- It declares a base URI against which ``$ref`` URI-references are resolved.
.. note::
This is analogous to the ``<base>`` `tag HTML
<https://html.spec.whatwg.org/multipage/semantics.html#the-base-element>`__.

It is best practice that every top-level schema should set ``$id`` to an
absolute-URI (not a relative reference), with a domain that you control. For
example, if you own the ``foo.bar`` domain, and you had a schema for addresses,
you may set its ``$id`` as follows:
Let's assume the URI ``https://example.com/schema/address`` and
``https://example.com/schema/billing-address`` both identify the
following schema.

.. schema_example::

{ "$id": "http://foo.bar/schemas/address.json" }
{
"$id": "/schemas/address",

This provides a unique identifier for the schema, as well as, in most
cases, indicating where it may be downloaded.
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}

But be aware of the second purpose of the ``$id`` property: that it
declares a base URI for ``$ref`` URI-references elsewhere in the file.
For example, if you had:
No matter which of the two URIs is used to retrieve this schema, the
base URI will be ``https://example.com/schemas/address``, which is the
result of the ``$id`` URI-reference resolving against the
`retrieval-uri`.

.. schema_example::
However, using a relative reference when setting a base URI can be
problematic. For example, we couldn't use this schema as an
anonymous schema because there would be no `retrieval-uri` and you
can't resolve a relative reference against nothing. For this and other
reasons, it's recommended that you always use an absolute URI when
declaring a base URI with ``$id``.

{ "$ref": "person.json" }
The base URI of the following schema will always be
``https://example.com/schemas/address`` no matter what the
`retrieval-uri` was or if it's used as an anonymous schema.

in the same file, a JSON schema validation library that supported network
fetching may fetch ``person.json`` from
``http://foo.bar/schemas/person.json``, even if ``address.json`` was loaded from
somewhere else, such as the local filesystem. The drafts do not define this
area of behaviour very clearly, and validator implementations may vary in
exactly how they try to locate the referenced schema.
.. schema_example::

{
"$id": "https://example.com/schemas/address",

|draft6|
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}

.. draft_specific::
.. note::
The behavior when setting a base URI that contains a URI fragment
is undefined and should not be used because implementations may
treat them differently.

--Draft 4
In Draft 4, ``$id`` is just ``id`` (without the dollar sign).
.. index::
single: $ref
single: structuring; $ref

The ``$id`` property should never be the empty string or an empty fragment
(``#``), since that doesn't really make sense.
.. _ref:

Using $id with $ref
```````````````````
$ref
----

``$id`` also provides a way to refer to subschema without using JSON Pointer.
This means you can refer to them by a unique name, rather than by where they
appear in the JSON tree.
A schema can reference another schema using the ``$ref`` keyword. The
value of ``$ref`` is a URI-reference that is resolved against the
schema's `base-uri`. When evaluating a schema, an implementation uses
the resolved identifier to retrieve the referenced schema and
evaluation is continued from the retrieved schema.

Reusing the address example above, we can add an ``$id`` property to the
address schema, and refer to it by that instead.
``$ref`` can be used anywhere a schema is expected. When an object
contains a ``$ref`` property, the object is considered a reference,
not a schema. Therefore, any other properties you put there will not
be treated as JSON Schema keywords and will be ignored by the
validator.

For this example, let's say we want to define a customer record, where
each customer may have both a shipping and a billing address.
Addresses are always the same---they have a street address, city and
state---so we don't want to duplicate that part of the schema
everywhere we want to store an address. Not only would that make the
schema more verbose, but it makes updating it in the future more
difficult. If our imaginary company were to start doing international
business in the future and we wanted to add a country field to all the
addresses, it would be better to do this in a single place rather than
everywhere that addresses are used.

.. schema_example::

{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://example.com/schemas/customer",

"definitions": {
"address": {
*"$id": "#address",
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
"type": "object",
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"shipping_address": { "$ref": "/schemas/address" },
"billing_address": { "$ref": "/schemas/address" }
},
"required": ["first_name", "last_name", "shipping_address", "billing_address"]
}

"type": "object",
The URI-references in ``$ref`` resolve against the schema's `base-uri`
(``https://example.com/schemas/customer``) which results in
``https://example.com/schemas/address``. The implementation retrieves
that schema and uses it to evaluate the "shipping_address" and
"billing_address" properties.

.. note::
When using ``$ref`` in an anonymous schema, relative references may
not be resolvable. Let's assume this example is used as an
anonymous schema.

.. schema_example::

{
"type": "object",
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"shipping_address": { "$ref": "https://example.com/schemas/address" },
"billing_address": { "$ref": "/schemas/address" }
},
"required": ["first_name", "last_name", "shipping_address", "billing_address"]
}

The ``$ref`` at ``/properties/shipping_address`` can resolve just
fine without a non-relative base URI to resolve against, but the
``$ref`` at ``/properties/billing_address`` can't resolve to a
non-relative URI and therefore can't can be used to retrieve the
address schema.

.. index::
single: definitions
single: structuring; definitions

.. _definitions:

definitions
-----------

Sometimes we have small subschemas that are only intended for use in
the current schema and it doesn't make sense to define them as
separate schemas. Although we can identify any subschema using JSON
Pointers or named anchors, the ``definitions`` keyword gives us a
standardized place to keep subschemas intended for reuse in the
current schema document.

Let's extend the previous customer schema example to use a common
schema for the name properties. It doesn't make sense to define a new
schema for this and it will only be used in this schema, so it's a
good candidate for using ``definitions``.

.. schema_example::

{
"$id": "https://example.com/schemas/customer",

"type": "object",
"properties": {
*"billing_address": { "$ref": "#address" },
*"shipping_address": { "$ref": "#address" }
"first_name": { "$ref": "#/definitions/name" },
"last_name": { "$ref": "#/definitions/name" },
"shipping_address": { "$ref": "/schemas/address" },
"billing_address": { "$ref": "/schemas/address" }
},
"required": ["first_name", "last_name", "shipping_address", "billing_address"],

"definitions": {
"name": { "type": "string" }
}
}

``$ref`` isn't just good for avoiding duplication. It can also be
useful for writing schemas that are easier to read and maintain.
Complex parts of the schema can be defined in ``definitions`` with
descriptive names and referenced where it's needed. This allows
readers of the schema to more quickly and easily understand the schema
at a high level before diving into the more complex parts.

.. note::
It's possible to reference an external subschema, but generally you
want to limit a ``$ref`` to referencing either an external schema
or an internal subschema defined in ``definitions``.

This functionality isn't currently supported by the Python ``jsonschema``
library.
.. index::
single: recursion
single: $ref
single: structuring; recursion; $ref

Extending
---------
.. _recursion:

The power of ``$ref`` really shines when it is used with the
combining keywords ``allOf``, ``anyOf`` and ``oneOf`` (see
:ref:`combining`).
Recursion
---------

Let's say that for a shipping address, we want to know whether the
address is a residential or business address, because the shipping
method used may depend on that. For a billing address, we don't
want to store that information, because it's not applicable.
The ``$ref`` keyword may be used to create recursive schemas that
refer to themselves. For example, you might have a ``person`` schema
that has an array of ``children``, each of which are also ``person``
instances.

To handle this, we'll update our definition of shipping address::
.. schema_example::

"shipping_address": { "$ref": "#/definitions/address" }
{
"type": "object",
"properties": {
"name": { "type": "string" },
"children": {
"type": "array",
* "items": { "$ref": "#" }
}
}
}
--
// A snippet of the British royal family tree
{
"name": "Elizabeth",
"children": [
{
"name": "Charles",
"children": [
{
"name": "William",
"children": [
{ "name": "George" },
{ "name": "Charlotte" }
]
},
{
"name": "Harry"
}
]
}
]
}

to instead use an ``allOf`` keyword entry combining both the core
address schema definition and an extra schema snippet for the address
type::
Above, we created a schema that refers to itself, effectively creating
a "loop" in the validator, which is both allowed and useful. Note,
however, that a ``$ref`` referring to another ``$ref`` could cause
an infinite loop in the resolver, and is explicitly disallowed.

"shipping_address": {
"allOf": [
// Here, we include our "core" address schema...
{ "$ref": "#/definitions/address" },
.. schema_example::

// ...and then extend it with stuff specific to a shipping
// address
{ "properties": {
"type": { "enum": [ "residential", "business" ] }
},
"required": ["type"]
}
]
{
"definitions": {
"alice": { "$ref": "#/definitions/bob" },
"bob": { "$ref": "#/definitions/alice" }
}
}

Tying this all together,
.. index::
single: bundling
single: $id
single: structuring; bundling; $id

.. _bundling:

Bundling
--------

Working with multiple schema documents is convenient for development,
but it is often more convenient for distribution to bundle all of your
schemas into a single schema document. This can be done using the
``$id`` keyword in a subschema. When ``$id`` is used in a subschema,
it creates a new `base-uri` that any references in that subschema and
any descendant subschemas will resolve against. The new `base-uri` is
the value of ``$id`` resolved against the `base-uri` of the schema it
appears in.

.. draft_specific::

--Draft 4
In Draft 4, ``$id`` is just ``id`` (without the dollar sign).

This example shows the customer schema example and the address schema
example bundled into a single schema document.

.. schema_example::

{
"$schema": "http://json-schema.org/draft-06/schema#",
"$id": "https://example.com/schemas/customer",

"type": "object",
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"shipping_address": { "$ref": "/schemas/address" },
"billing_address": { "$ref": "/schemas/address" }
},
"required": ["first_name", "last_name", "shipping_address", "billing_address"],

"definitions": {
"address": {
"$id": "/schemas/address",

"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
"city": { "type": "string" },
"state": { "$ref": "#/definitions/state" }
},
"required": ["street_address", "city", "state"]
}
},

"type": "object",
"required": ["street_address", "city", "state"],

"properties": {
"billing_address": { "$ref": "#/definitions/address" },
"shipping_address": {
"allOf": [
{ "$ref": "#/definitions/address" },
{ "properties":
{ "type": { "enum": [ "residential", "business" ] } },
"required": ["type"]
}
]
"definitions": {
"state": { "enum": ["CA", "NY", "... etc ..."] }
}
}
}
}
--X
// This fails, because it's missing an address type:
{
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC"
}
}
--
{
"shipping_address": {
"street_address": "1600 Pennsylvania Avenue NW",
"city": "Washington",
"state": "DC",
"type": "business"
}
}

From these basic pieces, it's possible to build very powerful
constructions without a lot of duplication.
Notice that the ``$ref`` keywords from the customer schema resolve the
same way they did before except that the address schema is now defined
at ``/definitions/address`` instead of a separate schema document. You
should also see that ``"$ref": "#/definitions/state"`` resolves to the
``definitions`` keyword in the address schema rather than the one at
the top level schema like it would if the subschema ``$id`` wasn't
used.

You might notice that this creates a situation where there are
multiple ways to identify a schema. Instead of referencing
``/schemas/address`` (``https://example.com/schemas/address``) You
could have used ``#/definitions/address``
(``https://example.com/schemas/customer#/definitions/address``). While
both of these will work, the one shown in the example is preferred.

.. note::
It is unusual to use ``$id`` in a subschema when developing
schemas. It's generally best not to use this feature explicitly and
use schema bundling tools to construct bundled schemas if such a
thing is needed.