From d7aa81876a5209f99d18155784529a8b3a385190 Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Wed, 6 Sep 2023 20:59:26 -0400 Subject: [PATCH 1/9] initial draft of zom zep --- draft/ZEP0006.md | 248 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 248 insertions(+) create mode 100644 draft/ZEP0006.md diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md new file mode 100644 index 0000000..950e7fb --- /dev/null +++ b/draft/ZEP0006.md @@ -0,0 +1,248 @@ +--- +layout: default +title: ZEP0006 +description: Defining a Zarr Object Model (ZOM) +parent: draft ZEPs +nav_order: 1 +--- + +# ZEP 6 - A Zarr Object Model + +Authors: + +* Davis Bennett([@d-v-b](https://github.com/d-v-b)) HHMI / Janelia Research Campus + +Status: Draft + +Type: Specification + +Created: 2023-07-20 + + +## Abstract + +This ZEP defines Zarr Object Models, or ZOMs. ZOMs are abstract representations of Zarr hierarchy. The core of a ZOM is a language-independent interface that describes an abstract hierarchy as a tree of nodes. + +The base ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attrs` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. Groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. A ZOM can be used by applications as the basis for a declarative, type-safe approach to managing Zarr hierarchies. + +## Definition of hierarchy structure + +This document distinguishes the *structure* of a Zarr hierarchy from the data stored in the hierarchy. The structure of a Zarr hierarchy is the layout of the tree of arrays and groups, and the metadata of those arrays and groups. This definition omits the data stored in the arrays, and the particular storage backend used to store data and metadata. By these definitions, two distinct Zarr hierarchies can have the same structure even if their arrays contain different values, and / or the hierarchies are stored using different storage backends. + +Because the structure of a zarr hierarchy is decoupled, by definition, from the data stored in the hierarchy, it should be possible to represent the structure of a Zarr hierarchy with a compact data structure or interface. Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure. This document formalizes the Zarr Object Model, an abstract model of the structure of a Zarr hierarchy. The ZOM serves as a foundation for tools that create and manipulate Zarr hierarchies at a structural level. + +## Specification of the base Zarr Object Model + +A node is an object with a property called `attrs` (short for "attributes"), which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attrs` must be a JSON-serializable object. + +The base ZOM defines exactly two types of node: groups and arrays. This definition will use the unqualified terms "array" and "group" to refer to the two nodes defined in the ZOM. Where necessary to avoid ambiguity, the objects *represented* by ZOM arrays and ZOM groups, i.e. Zarr arrays and Zarr groups, will be referred to as "Zarr arrays" and "Zarr groups". + +ZOM arrays and ZOM groups represent Zarr arrays and Zarr groups in the simplest way possible that still conforms to the definition of "node" given above. Thus, a ZOM array is a node with properties identical to those defined in a particular specification of Zarr array metadata, unless one of those Zarr array properties contains user metadata, in which case a ZOM array does not include that property (since user metadata is already represented by the `attrs` property of the array). This definition is parametric with respect to a particular Zarr specification in order to accomodate future versions of Zarr that may add new properties to Zarr arrays. + +Similarly, a ZOM group is a node with properties identical to those defined in a specification of Zarr group metadata, unless one of those properties contains user metadata, in which case a ZOM group does not contain that property, for the same reason given above for arrays. Beyond the properties of Zarr groups defined in a particular Zarr specification, a ZOM group has an additional property: + +- `members`: a key-value data structure where the keys are strings and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them. + +If future versions of Zarr use a property called `members` for some element of Zarr group metadata, then there would be a naming collision between the `members` property of a Zarr group and the `members` property of a ZOM group. In this case, the ZOM group would rename the Zarr group's `members` property to `_members`, and any additional name collisions would be resolved by prepending additional underscore ("_") characters. E.g., in the unlikely case that `members` and `_members` are *both* listed in Zarr group metadata, then the schema group representation would map the `members` property of the Zarr group to a property called `__members`. + +Thus, ZOM groups and ZOM arrays can represent the structure of a Zarr hierarchy, per the description given in [#definition-of-hierarchy-structure]. + +### ZOM in JSON + +The ZOM representation of a Zarr hierarchy can be easily represented as a JSON object. Here is an example of a ZOM group representing a Zarr group that contains a single two-dimensional Zarr array using Zarr version 2. Both the Zarr group and the Zarr array contain user metadata. + +```json +{ + "zarr_format" : 2, + "attrs": { + "foo" : 10, + "bar" : "hello" + }, + "members": { + "foo": { + "zarr_format" : 2, + "shape" : [10,10], + "chunks": [1,1], + "dtype": "|u1", + "compressor": null, + "fill_value": 0, + "order": "C", + "filters": null, + "attrs" : { + "name": "my cool array" + } + } + } +} +``` + +The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Zarr V2 expressed as a JSON schema: +```json +{ + "$ref": "#/definitions/Group", + "definitions": { + "Array": { + "title": "Array", + "description": "Model of a Zarr Version 2 Array", + "type": "object", + "properties": { + "attrs": { + "title": "Attrs", + "type": "object" + }, + "shape": { + "title": "Shape", + "type": "array", + "items": { + "type": "integer" + } + }, + "chunks": { + "title": "Chunks", + "type": "array", + "items": { + "type": "integer" + } + }, + "dtype": { + "title": "Dtype", + "anyOf": [ + { + "type": "string" + }, + { + "type": "array", + "items": { + "type": "string" + } + } + ] + }, + "compressor": { + "title": "Compressor", + "type": "object" + }, + "fill_value": { + "title": "Fill Value" + }, + "order": { + "title": "Order", + "enum": [ + "C", + "F" + ], + "type": "string" + }, + "filters": { + "title": "Filters", + "type": "array", + "items": { + "type": "object" + } + }, + "dimension_separator": { + "title": "Dimension Separator", + "enum": [ + ".", + "/" + ], + "type": "string" + }, + "zarr_version": { + "title": "Zarr Version", + "default": 2, + "type": "integer" + } + }, + "required": [ + "attrs", + "shape", + "chunks", + "dtype", + "compressor", + "order", + "filters" + ], + "additionalProperties": false + }, + "Group": { + "title": "Group", + "description": "Model of a Zarr Version 2 Group", + "type": "object", + "properties": { + "attrs": { + "title": "Attrs", + "type": "object" + }, + "members": { + "title": "Members", + "type": "object", + "additionalProperties": { + "anyOf": [ + { + "$ref": "#/definitions/Array" + }, + { + "$ref": "#/definitions/Group" + } + ] + } + }, + "zarr_version": { + "title": "Zarr Version", + "default": 2, + "type": "integer" + } + }, + "required": [ + "attrs", + "members" + ], + "additionalProperties": false + } + } +} +``` + +And Zarr V3: + +```json +# insert schema for v3 here +``` + + +## Related Work + + + +## Implementation + +- pydantic zarr +- ? + +## Discussion + +- todo: show that consolidated metadata can be achieved by applying a flattening transformation to a ZOM representation of a hierarchy. +- - The origins of consolidated metadata: + * + * + + +## References and Footnotes + + +## License + +

+ + CC0 + +
+ To the extent possible under law, + + the authors + have waived all copyright and related or neighboring rights to + ZEP 1. +

From 7bd949dca8837a73ac97047751cd79f48acb92e3 Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Thu, 21 Sep 2023 09:39:21 -0400 Subject: [PATCH 2/9] add motivating hierarchy equality example, give zarr v3 priority in examples, change attrs to attributes --- draft/ZEP0006.md | 68 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 21 deletions(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index 950e7fb..ce7a090 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -1,7 +1,7 @@ --- layout: default title: ZEP0006 -description: Defining a Zarr Object Model (ZOM) +description: Zarr Object Models (ZOMs) parent: draft ZEPs nav_order: 1 --- @@ -23,7 +23,20 @@ Created: 2023-07-20 This ZEP defines Zarr Object Models, or ZOMs. ZOMs are abstract representations of Zarr hierarchy. The core of a ZOM is a language-independent interface that describes an abstract hierarchy as a tree of nodes. -The base ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attrs` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. Groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. A ZOM can be used by applications as the basis for a declarative, type-safe approach to managing Zarr hierarchies. +The base ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. Groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. A ZOM can be used by applications as the basis for a declarative, type-safe approach to managing Zarr hierarchies. + +## Motivation and Scope + +The reference python implementation of Zarr provides APIs for managing Zarr groups and Zarr arrays. The final product of these operations is a collection of Zarr groups and arrays, i.e. a Zarr hierarchy. But the reference python implementation does *not* provide APIs for managing Zarr hierarchies directly. + +To see why this matters, consider a programmer who wishes to check that two Zarr hierarchies (called "A" and "B") are identically structured, i.e. that the two hierarchies have the same tree structure, with structurally identical nodes. This requires resolving two checks: +- for each Zarr array in hierarchy A, there is a Zarr array in hierarchy B with the same position in the hierarchy, the same metadata, and the same array properties. +- for each Zarr group in hierarchy A, there is a Zarr group in hierarchy B with the same position in the hierarchy, the same metadata, and that the members of both groups have members that pass this check and the previously defined array equality check. + +Using an API that only references Zarr arrays and groups, the programmer will be forced to write a new hierarchy equality checking routine for each new hierarchy. But if the programmer has access to a data structure that can represent a Zarr hierarchy, then the aforementioned binary similarity operation can be defined just once for this data structure, and it will work for any two Zarr hierarchies. This is a much better outcome. + +There are many situations when programmers must read, validate, and write Zarr hierarchies. Because the Zarr specifications do not define a data structure that represents the Zarr hierarchy itself, i.e. a tree of arrays and groups, developers who attempt to create APIs for manipulating entire Zarr hierarchies must design such a data structure independently, which may lead to unnecessary fragmentation and redundant efforts. Thus, this ZEP introduces these data structures. + ## Definition of hierarchy structure @@ -33,11 +46,11 @@ Because the structure of a zarr hierarchy is decoupled, by definition, from the ## Specification of the base Zarr Object Model -A node is an object with a property called `attrs` (short for "attributes"), which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attrs` must be a JSON-serializable object. +A node is an object with a property called `attributes` (short for "attributes"), which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attributes` must be a JSON-serializable object. The base ZOM defines exactly two types of node: groups and arrays. This definition will use the unqualified terms "array" and "group" to refer to the two nodes defined in the ZOM. Where necessary to avoid ambiguity, the objects *represented* by ZOM arrays and ZOM groups, i.e. Zarr arrays and Zarr groups, will be referred to as "Zarr arrays" and "Zarr groups". -ZOM arrays and ZOM groups represent Zarr arrays and Zarr groups in the simplest way possible that still conforms to the definition of "node" given above. Thus, a ZOM array is a node with properties identical to those defined in a particular specification of Zarr array metadata, unless one of those Zarr array properties contains user metadata, in which case a ZOM array does not include that property (since user metadata is already represented by the `attrs` property of the array). This definition is parametric with respect to a particular Zarr specification in order to accomodate future versions of Zarr that may add new properties to Zarr arrays. +ZOM arrays and ZOM groups represent Zarr arrays and Zarr groups in the simplest way possible that still conforms to the definition of "node" given above. Thus, a ZOM array is a node with properties identical to those defined in a particular specification of Zarr array metadata, unless one of those Zarr array properties contains user metadata, in which case a ZOM array does not include that property (since user metadata is already represented by the `attributes` property of the array). This definition is parametric with respect to a particular Zarr specification in order to accomodate future versions of Zarr that may add new properties to Zarr arrays. Similarly, a ZOM group is a node with properties identical to those defined in a specification of Zarr group metadata, unless one of those properties contains user metadata, in which case a ZOM group does not contain that property, for the same reason given above for arrays. Beyond the properties of Zarr groups defined in a particular Zarr specification, a ZOM group has an additional property: @@ -49,12 +62,20 @@ Thus, ZOM groups and ZOM arrays can represent the structure of a Zarr hierarchy, ### ZOM in JSON -The ZOM representation of a Zarr hierarchy can be easily represented as a JSON object. Here is an example of a ZOM group representing a Zarr group that contains a single two-dimensional Zarr array using Zarr version 2. Both the Zarr group and the Zarr array contain user metadata. +The ZOM representation of a Zarr hierarchy can be easily represented as a JSON object. + +Here is an example of a ZOM group representing a Zarr group that contains a single two-dimensional Zarr array using Zarr version 3. Both the Zarr group and the Zarr array contain user metadata. + +```json +Insert V3 hierarchy example here +``` + +And the same can be done for a similar hierarchy defined in Zarr V2. ```json { "zarr_format" : 2, - "attrs": { + "attributes": { "foo" : 10, "bar" : "hello" }, @@ -68,7 +89,7 @@ The ZOM representation of a Zarr hierarchy can be easily represented as a JSON o "fill_value": 0, "order": "C", "filters": null, - "attrs" : { + "attributes" : { "name": "my cool array" } } @@ -76,7 +97,17 @@ The ZOM representation of a Zarr hierarchy can be easily represented as a JSON o } ``` -The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Zarr V2 expressed as a JSON schema: +To facilitate adoption of new Zarr versions, it may be desirable to define a mapping from ZOM to ZOM, e.g. ZOM[V2] -> ZOM[V3]. Programs could use this mapping to execute automatic conversions of hierarchies to newer Zarr versions. + + +A ZOM can also be represented as a JSON schema. Here is a the ZOM for Zarr V3 expressed as a JSON schema: + +```json +# insert schema for v3 here +``` + +And likewise for Zarr V2: + ```json { "$ref": "#/definitions/Group", @@ -86,8 +117,8 @@ The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Z "description": "Model of a Zarr Version 2 Array", "type": "object", "properties": { - "attrs": { - "title": "Attrs", + "attributes": { + "title": "Attributess", "type": "object" }, "shape": { @@ -155,7 +186,7 @@ The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Z } }, "required": [ - "attrs", + "attributess", "shape", "chunks", "dtype", @@ -170,8 +201,8 @@ The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Z "description": "Model of a Zarr Version 2 Group", "type": "object", "properties": { - "attrs": { - "title": "Attrs", + "attributes": { + "title": "Attributes", "type": "object" }, "members": { @@ -195,7 +226,7 @@ The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Z } }, "required": [ - "attrs", + "attributes", "members" ], "additionalProperties": false @@ -204,12 +235,6 @@ The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Z } ``` -And Zarr V3: - -```json -# insert schema for v3 here -``` - ## Related Work @@ -230,7 +255,8 @@ And Zarr V3: ## References and Footnotes - +[^1]: https://github.com/zarr-developers/geozarr-spec +[^2]: http://api.csswg.org/bikeshed/?url=https://raw.githubusercontent.com/ome/ngff/master/0.4/index.bs#multiscale-md ## License

From 60594f33a5cd606bf61e14a11209c99a0e9a75a3 Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Thu, 21 Sep 2023 09:47:50 -0400 Subject: [PATCH 3/9] clarifiy hierarchy equality example --- draft/ZEP0006.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index ce7a090..dd7ec25 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -31,7 +31,7 @@ The reference python implementation of Zarr provides APIs for managing Zarr grou To see why this matters, consider a programmer who wishes to check that two Zarr hierarchies (called "A" and "B") are identically structured, i.e. that the two hierarchies have the same tree structure, with structurally identical nodes. This requires resolving two checks: - for each Zarr array in hierarchy A, there is a Zarr array in hierarchy B with the same position in the hierarchy, the same metadata, and the same array properties. -- for each Zarr group in hierarchy A, there is a Zarr group in hierarchy B with the same position in the hierarchy, the same metadata, and that the members of both groups have members that pass this check and the previously defined array equality check. +- for each Zarr group in hierarchy A, there is a Zarr group in hierarchy B with the same position in the hierarchy, the same metadata; additionaly, the members of both groups pass this check (for group members) or the previously defined array equality check (for array members). Using an API that only references Zarr arrays and groups, the programmer will be forced to write a new hierarchy equality checking routine for each new hierarchy. But if the programmer has access to a data structure that can represent a Zarr hierarchy, then the aforementioned binary similarity operation can be defined just once for this data structure, and it will work for any two Zarr hierarchies. This is a much better outcome. From 1b9eb87e46e07d66596275fd8f19224882ea0de8 Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Thu, 21 Sep 2023 11:50:43 -0400 Subject: [PATCH 4/9] attributes is not short for itself --- draft/ZEP0006.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index dd7ec25..bd918ed 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -46,7 +46,7 @@ Because the structure of a zarr hierarchy is decoupled, by definition, from the ## Specification of the base Zarr Object Model -A node is an object with a property called `attributes` (short for "attributes"), which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attributes` must be a JSON-serializable object. +A node is an object with a property called `attributes`, which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attributes` must be a JSON-serializable object. The base ZOM defines exactly two types of node: groups and arrays. This definition will use the unqualified terms "array" and "group" to refer to the two nodes defined in the ZOM. Where necessary to avoid ambiguity, the objects *represented* by ZOM arrays and ZOM groups, i.e. Zarr arrays and Zarr groups, will be referred to as "Zarr arrays" and "Zarr groups". From 46e71c498b88d60f797b2e0a1726719474c028e6 Mon Sep 17 00:00:00 2001 From: Sanket Verma Date: Thu, 26 Oct 2023 12:27:15 +0000 Subject: [PATCH 5/9] Update draft/ZEP0006.md --- draft/ZEP0006.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index bd918ed..4c4d1dd 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -3,7 +3,7 @@ layout: default title: ZEP0006 description: Zarr Object Models (ZOMs) parent: draft ZEPs -nav_order: 1 +nav_order: 6 --- # ZEP 6 - A Zarr Object Model From e1b8755caa91c789fba5a31c5b0fd89fa1a04438 Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Sun, 29 Oct 2023 20:57:51 +0100 Subject: [PATCH 6/9] spec: rework motivation (it's simpler now) and add examples / JSON schemas for v3 --- draft/ZEP0006.md | 319 +++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 293 insertions(+), 26 deletions(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index 4c4d1dd..c390b54 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -10,7 +10,7 @@ nav_order: 6 Authors: -* Davis Bennett([@d-v-b](https://github.com/d-v-b)) HHMI / Janelia Research Campus +* Davis Bennett([@d-v-b](https://github.com/d-v-b)) Status: Draft @@ -21,32 +21,33 @@ Created: 2023-07-20 ## Abstract -This ZEP defines Zarr Object Models, or ZOMs. ZOMs are abstract representations of Zarr hierarchy. The core of a ZOM is a language-independent interface that describes an abstract hierarchy as a tree of nodes. +This ZEP defines Zarr Object Models, or ZOMs. A ZOM is a language-independent interface that describes an abstract Zarr hierarchy as a tree of nodes. ZOMs are parametrized by a particular Zarr version, so there is a ZOM for Zarr Version 2, which differs from the ZOM for Zarr V3. -The base ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. Groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. A ZOM can be used by applications as the basis for a declarative, type-safe approach to managing Zarr hierarchies. +The basic ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. In the base ZOM, groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. This definition is designed to be abstract enough that it can be implemented by a range of programming languages, and expressed in a wide range of interchange formats. -## Motivation and Scope +The ZOM forms the basis for a declarative, type-safe approach to managing Zarr hierarchies. -The reference python implementation of Zarr provides APIs for managing Zarr groups and Zarr arrays. The final product of these operations is a collection of Zarr groups and arrays, i.e. a Zarr hierarchy. But the reference python implementation does *not* provide APIs for managing Zarr hierarchies directly. +## Motivation and Scope -To see why this matters, consider a programmer who wishes to check that two Zarr hierarchies (called "A" and "B") are identically structured, i.e. that the two hierarchies have the same tree structure, with structurally identical nodes. This requires resolving two checks: -- for each Zarr array in hierarchy A, there is a Zarr array in hierarchy B with the same position in the hierarchy, the same metadata, and the same array properties. -- for each Zarr group in hierarchy A, there is a Zarr group in hierarchy B with the same position in the hierarchy, the same metadata; additionaly, the members of both groups pass this check (for group members) or the previously defined array equality check (for array members). +The Zarr specifications define models of arrays and groups; the Zarr specifications do not define models for hierarchies of arrays and groups. This is unfortunate, because many users of Zarr work primarily with structured hierarchies. -Using an API that only references Zarr arrays and groups, the programmer will be forced to write a new hierarchy equality checking routine for each new hierarchy. But if the programmer has access to a data structure that can represent a Zarr hierarchy, then the aforementioned binary similarity operation can be defined just once for this data structure, and it will work for any two Zarr hierarchies. This is a much better outcome. +For example, the python library `xarray` defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata. The full specification of this format can be found [here](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding). For `xarray` (or any other application that works with structured hieararchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, `xarray` must first check if the potential source of data is an `xarray`-compliant Zarr hierarchy. -There are many situations when programmers must read, validate, and write Zarr hierarchies. Because the Zarr specifications do not define a data structure that represents the Zarr hierarchy itself, i.e. a tree of arrays and groups, developers who attempt to create APIs for manipulating entire Zarr hierarchies must design such a data structure independently, which may lead to unnecessary fragmentation and redundant efforts. Thus, this ZEP introduces these data structures. +Creating and validating Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that implements the definition. The latter is preferable, but it first requires a machine-readable data model for a Zarr hierarchy. +Such models should be sufficient to express the structure of a Zarr hierarchy, and these models must be usable by Zarr implementations as a basis for declarative APIs for creating and validating Zarr hierarchies. That is the central goal of this proposal. ## Definition of hierarchy structure This document distinguishes the *structure* of a Zarr hierarchy from the data stored in the hierarchy. The structure of a Zarr hierarchy is the layout of the tree of arrays and groups, and the metadata of those arrays and groups. This definition omits the data stored in the arrays, and the particular storage backend used to store data and metadata. By these definitions, two distinct Zarr hierarchies can have the same structure even if their arrays contain different values, and / or the hierarchies are stored using different storage backends. -Because the structure of a zarr hierarchy is decoupled, by definition, from the data stored in the hierarchy, it should be possible to represent the structure of a Zarr hierarchy with a compact data structure or interface. Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure. This document formalizes the Zarr Object Model, an abstract model of the structure of a Zarr hierarchy. The ZOM serves as a foundation for tools that create and manipulate Zarr hierarchies at a structural level. +Because the structure of a zarr hierarchy is decoupled, by definition, from the data stored in the hierarchy, it should be possible to represent the structure of a Zarr hierarchy with a compact data structure or interface. Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure. This document formalizes the Zarr Object Model (ZOM), an abstract model of the structure of a Zarr hierarchy. The ZOM serves as a foundation for tools that create and manipulate Zarr hierarchies at a structural level. ## Specification of the base Zarr Object Model -A node is an object with a property called `attributes`, which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attributes` must be a JSON-serializable object. +We begin with a definition of a "base" Zarr Object Model. On its own, the base ZOM is not useful for working with actual Zarr hierarchies, because it contains a reference to an unspecified Zarr version. By supplying definitions from a particular Zarr version, we can specialize the base ZOM and produce an object that can be used for doing actual work. + +A node is an object with a property called `attributes`, which is a key-value data structure that contains content described as "arbitrary user metadata" in Zarr specifications. As of Zarr versions 2 and 3, `attributes` must be a JSON-serializable object with string keys. The base ZOM defines exactly two types of node: groups and arrays. This definition will use the unqualified terms "array" and "group" to refer to the two nodes defined in the ZOM. Where necessary to avoid ambiguity, the objects *represented* by ZOM arrays and ZOM groups, i.e. Zarr arrays and Zarr groups, will be referred to as "Zarr arrays" and "Zarr groups". @@ -54,56 +55,313 @@ ZOM arrays and ZOM groups represent Zarr arrays and Zarr groups in the simplest Similarly, a ZOM group is a node with properties identical to those defined in a specification of Zarr group metadata, unless one of those properties contains user metadata, in which case a ZOM group does not contain that property, for the same reason given above for arrays. Beyond the properties of Zarr groups defined in a particular Zarr specification, a ZOM group has an additional property: -- `members`: a key-value data structure where the keys are strings and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them. +- `members`: a key-value data structure where the keys are the subset of strings that are permitted node names according to a particular Zarr specification, and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them. If future versions of Zarr use a property called `members` for some element of Zarr group metadata, then there would be a naming collision between the `members` property of a Zarr group and the `members` property of a ZOM group. In this case, the ZOM group would rename the Zarr group's `members` property to `_members`, and any additional name collisions would be resolved by prepending additional underscore ("_") characters. E.g., in the unlikely case that `members` and `_members` are *both* listed in Zarr group metadata, then the schema group representation would map the `members` property of the Zarr group to a property called `__members`. Thus, ZOM groups and ZOM arrays can represent the structure of a Zarr hierarchy, per the description given in [#definition-of-hierarchy-structure]. -### ZOM in JSON +### Zarr Object Models in JSON The ZOM representation of a Zarr hierarchy can be easily represented as a JSON object. -Here is an example of a ZOM group representing a Zarr group that contains a single two-dimensional Zarr array using Zarr version 3. Both the Zarr group and the Zarr array contain user metadata. +See below an example of a Zarr version 3 ZOM group representing a Zarr group that contains a single Zarr array. Both the Zarr group and the Zarr array contain user metadata. ```json -Insert V3 hierarchy example here +{ + "zarr_format": 3, + "node_type": "group", + "attributes": { + "foo": 42, + "bar": false + }, + "members": { + "array": { + "zarr_format": 3, + "node_type": "array", + "attributes": { + "baz": [ + 1, + 2, + 3 + ] + }, + "shape": [ + 1000, + 1000 + ], + "data_type": "|u1", + "chunk_grid": { + "name": "regular", + "configuration": { + "chunk_shape": [ + 1000, + 100 + ] + } + }, + "chunk_key_encoding": { + "name": "default", + "configuration": { + "separator": "/" + } + }, + "fill_value": 0, + "codecs": [ + { + "name": "GZip", + "configuration": { + "level": 1 + } + } + ], + "storage_transformers": null, + "dimension_names": [ + "rows", + "columns" + ] + } + } +} ``` -And the same can be done for a similar hierarchy defined in Zarr V2. +A similar hierarchy defined in Zarr V2 can also be represented as a ZOM group: ```json { "zarr_format" : 2, "attributes": { - "foo" : 10, - "bar" : "hello" + "foo" : 42, + "bar" : false }, "members": { "foo": { "zarr_format" : 2, - "shape" : [10,10], - "chunks": [1,1], + "shape" : [1000, 1000], + "chunks": [100, 100], "dtype": "|u1", "compressor": null, "fill_value": 0, "order": "C", - "filters": null, + "dimension_separator": "/", + "filters": {}, "attributes" : { - "name": "my cool array" + "baz": true } } } } ``` -To facilitate adoption of new Zarr versions, it may be desirable to define a mapping from ZOM to ZOM, e.g. ZOM[V2] -> ZOM[V3]. Programs could use this mapping to execute automatic conversions of hierarchies to newer Zarr versions. - +### Zarr Object Models in JSON schema A ZOM can also be represented as a JSON schema. Here is a the ZOM for Zarr V3 expressed as a JSON schema: ```json -# insert schema for v3 here +{ + "$defs": { + "ArraySpec": { + "additionalProperties": false, + "description": "A model of a Zarr version 3 Array", + "properties": { + "zarr_format": { + "const": 3, + "default": 3, + "title": "Zarr Format" + }, + "node_type": { + "const": "array", + "default": "array", + "title": "Node Type" + }, + "attributes": { + "default": {}, + "title": "Attributes", + "type": "object" + }, + "shape": { + "items": { + "type": "integer" + }, + "title": "Shape", + "type": "array" + }, + "data_type": { + "title": "Data Type", + "type": "string" + }, + "chunk_grid": { + "$ref": "#/$defs/NamedConfig" + }, + "chunk_key_encoding": { + "$ref": "#/$defs/NamedConfig" + }, + "fill_value": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "integer" + }, + { + "type": "number" + }, + { + "const": "Infinity" + }, + { + "const": "-Infinity" + }, + { + "const": "NaN" + }, + { + "type": "string" + }, + { + "maxItems": 2, + "minItems": 2, + "prefixItems": [ + { + "type": "number" + }, + { + "type": "number" + } + ], + "type": "array" + }, + { + "items": { + "type": "integer" + }, + "type": "array" + } + ], + "title": "Fill Value" + }, + "codecs": { + "items": { + "$ref": "#/$defs/NamedConfig" + }, + "title": "Codecs", + "type": "array" + }, + "storage_transformers": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/NamedConfig" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "default": null, + "title": "Storage Transformers" + }, + "dimension_names": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "title": "Dimension Names" + } + }, + "required": [ + "shape", + "data_type", + "chunk_grid", + "chunk_key_encoding", + "fill_value", + "codecs" + ], + "title": "ArraySpec", + "type": "object" + }, + "GroupSpec": { + "additionalProperties": false, + "properties": { + "zarr_format": { + "const": 3, + "default": 3, + "title": "Zarr Format" + }, + "node_type": { + "const": "group", + "default": "group", + "title": "Node Type" + }, + "attributes": { + "title": "Attributes", + "type": "object" + }, + "members": { + "additionalProperties": { + "anyOf": [ + { + "$ref": "#/$defs/GroupSpec" + }, + { + "$ref": "#/$defs/ArraySpec" + } + ] + }, + "default": {}, + "title": "Members", + "type": "object" + } + }, + "required": [ + "attributes", + "members" + ], + "title": "GroupSpec", + "type": "object" + }, + "NamedConfig": { + "additionalProperties": false, + "properties": { + "name": { + "title": "Name", + "type": "string" + }, + "configuration": { + "anyOf": [ + { + "type": "object" + }, + { + "type": "null" + } + ], + "title": "Configuration" + } + }, + "required": [ + "name", + "configuration" + ], + "title": "NamedConfig", + "type": "object" + } + }, + "allOf": [ + { + "$ref": "#/$defs/GroupSpec" + } + ] +} ``` And likewise for Zarr V2: @@ -235,6 +493,15 @@ And likewise for Zarr V2: } ``` +To facilitate adoption of new Zarr versions, it may be desirable to define a mapping from ZOM to ZOM, e.g. ZOM for Zarr v2 -> ZOM for Zarr v3. Programs could use such mappings to execute programmatic conversions of hierarchies to newer Zarr versions. + +### Implementing consolidated metadata via Zarr Object Models + +The time required to traverse large, deeply-nested Zarr hierarchies stored on high-latency backends (e.g., cloud storage) can be onerous for applications that consume Zarr containers. One solution to this problem is to consolidate the metadata for each node or group in the hierarchy into a document stored at the root of the hierarchy. + + + + ## Related Work From a89bde6f420d34bc38215383edf9f9f8f7527f6b Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Sun, 29 Oct 2023 22:26:05 +0100 Subject: [PATCH 7/9] spec: add validation examples --- draft/ZEP0006.md | 130 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index c390b54..043786d 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -499,8 +499,138 @@ To facilitate adoption of new Zarr versions, it may be desirable to define a map The time required to traverse large, deeply-nested Zarr hierarchies stored on high-latency backends (e.g., cloud storage) can be onerous for applications that consume Zarr containers. One solution to this problem is to consolidate the metadata for each node or group in the hierarchy into a document stored at the root of the hierarchy. +### Validating Zarr Object Models + +A key motivator for developing the ZOM was the need to *validate* Zarr hierarchies. Communities that use Zarr often define conventions for storing their particular data models in Zarr hierarchies, and often these conventions could be statically checked if there was a data model for the hierarchy that was amenable to static type checking. The ZOM satisfies this constraint. + +For example, consider the `xarray` model introduced earlier, which requires a key ("_ARRAY_DIMENSIONS") to be present in array metadata. If we define the ZOM for Zarr V2 as dataclasses, we can use the python type-checking tool `mypy` to statically check that a Zarr hierarchy complies with the `xarray` convention: + +```python +from dataclasses import dataclass +from typing import Generic, TypeVar, Mapping, List, Any, Union + +TAttrs = TypeVar('TAttrs') +TMember = TypeVar('TMember') + +@dataclass +class GroupSpec(Generic[TAttrs, TMember]): + zarr_version = 2 + attributes: TAttrs + members: Mapping[str, TMember] + +@dataclass +class ArraySpec(Generic[TAttrs]): + zarr_version = 2 + attributes: TAttrs + shape: List[int] + dtype: str + chunks: List[int] + compressor: Any + dimension_separator: Union[Literal["."], Literal["/"]] + +@dataclass +class XAttrs: + _ARRAY_DIMENSIONS: List[str] + +XArraySpec = ArraySpec[XAttrs] +XGroupSpec = GroupSpec[Any, Union[XArray, GroupSpec]] + +# this passes the type checker +valid = XGroupSpec( + attributes={}, + members={ + 'array_0': XArraySpec( + shape=[10,10], + attributes=XAttrs(_ARRAY_DIMENSIONS= ['a','b']), + dtype='uint8', + chunks=[10,10], + compressor=None, + dimension_separator='.')}, + ) + +# This fails type checking because of the mssing _ARRAY_DIMENSIONS array attribute +invalid = XGroupSpec( + attributes={}, + members={ + 'array_0': XArraySpec( + shape=[10,10], + attributes={'foo': 10}, + dtype='uint8', + chunks=[10,10], + compressor=None, + dimension_separator='/')}, + ) +""" +Argument "attributes" to "ArraySpec" has incompatible type "dict[str, int]"; expected "XAttrs" +""" +``` + +The same result is possible in TypeScript: + +```typescript + +type GroupSpec = { + zarr_version: 2 + attributes: TAttr + members: {[key: string]: TMember} +} + +type ArraySpec = { + zarr_version: 2 + attributes: TAttr + shape: number[] + chunks: number[] + dtype: string + compressor: any + dimension_separator: "." | "/" +} +type XAttrs = { + _ARRAY_DIMENSIONS: string[] +} + +type XArraySpec = ArraySpec +type XGroupSpec = GroupSpec + +const valid: XGroupSpec = { + zarr_version: 2, + attributes: {}, + members: { + 'array_0': { + zarr_version: 2, + dtype: 'uint8', + attributes: {'_ARRAY_DIMENSIONS': ['a', 'b']}, + shape: [10, 10], + chunks: [10, 10], + compressor: undefined, + dimension_separator: "/" + } + } + } + +// This fails type checking because of the mssing _ARRAY_DIMENSIONS array attribute + const invalid: XGroupSpec = { + zarr_version: 2, + attributes: {}, + members: { + 'array_0': { + zarr_version: 2, + dtype: 'uint8', + attributes: {'foo': ['a', 'b']}, + shape: [10, 10], + chunks: [10, 10], + compressor: undefined, + dimension_separator: "/" + } + } + } +/* +Type '{ foo: string[]; }' is not assignable to type 'XAttrs'. + Object literal may only specify known properties, and ''foo'' does not exist in type 'XAttrs'.(2322) +*/ +``` +Static type checking on ZOM data structures offers an additional level of safety for applications that manipulate structured Zarr hierarchies, but not every invariant of a structured hierarchy can be expressed statically -- consider the constraint "the length of the `_ARRAY_DIMENSIONS` attribute must match the length of the `shape` attribute". This is not statically checkable, because the shape of an array may not be known before runtime. Such value-dependent validation can be added by runtime type checkers like `pydantic` for python, or `zod` for TypeScript. ## Related Work From 659a09048effc268c44d9b9535ef607fae6bfd3d Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Sun, 29 Oct 2023 22:56:18 +0100 Subject: [PATCH 8/9] spec: style --- draft/ZEP0006.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index 043786d..90eab27 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -21,28 +21,24 @@ Created: 2023-07-20 ## Abstract -This ZEP defines Zarr Object Models, or ZOMs. A ZOM is a language-independent interface that describes an abstract Zarr hierarchy as a tree of nodes. ZOMs are parametrized by a particular Zarr version, so there is a ZOM for Zarr Version 2, which differs from the ZOM for Zarr V3. +This ZEP defines Zarr Object Models, or ZOMs. A ZOM is a language-independent interface that describes an abstract Zarr hierarchy as a tree of nodes. The ZOM forms the basis for a declarative, type-safe approach to managing Zarr hierarchies. -The basic ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. In the base ZOM, groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. This definition is designed to be abstract enough that it can be implemented by a range of programming languages, and expressed in a wide range of interchange formats. +A ZOM defines two types of nodes: arrays and groups, which schematize Zarr arrays and Zarr groups, respectively. ZOM arrays and groups both have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. ZOM groups have a property called `members`, which is an object with string keys and values that are either arrays or groups; thus, the ZOM can represent an arbitrarily structured tree of arrays and groups. -The ZOM forms the basis for a declarative, type-safe approach to managing Zarr hierarchies. +These definitions are designed to be abstract enough to be implemented by a range of programming languages, and expressed in a wide range of interchange formats. Applications can use the ZOM for a particular Zarr version to implement declarative APIs for accessing structured Zarr hierarchies. ## Motivation and Scope -The Zarr specifications define models of arrays and groups; the Zarr specifications do not define models for hierarchies of arrays and groups. This is unfortunate, because many users of Zarr work primarily with structured hierarchies. +The Zarr specifications define models for arrays and groups, but the Zarr specifications do not define models for *hierarchies* of arrays and groups. This is unfortunate, because many applications using Zarr operate on the level of structured hierarchies rather than individual groups or arrays. -For example, the python library `xarray` defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata. The full specification of this format can be found [here](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding). For `xarray` (or any other application that works with structured hieararchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, `xarray` must first check if the potential source of data is an `xarray`-compliant Zarr hierarchy. +For example, the python library `xarray` defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata (a key called "_ARRAY_DIMENSIONS", which must have a list of strings as its value). The full specification of this format can be found [here](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding). For `xarray` (or any other application that works with structured hierarchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, `xarray` must first check if the potential source of data is an `xarray`-compliant Zarr hierarchy. -Creating and validating Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that implements the definition. The latter is preferable, but it first requires a machine-readable data model for a Zarr hierarchy. - -Such models should be sufficient to express the structure of a Zarr hierarchy, and these models must be usable by Zarr implementations as a basis for declarative APIs for creating and validating Zarr hierarchies. That is the central goal of this proposal. +Creating and validating Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that creates a Zarr hiearchiy consistent with that definition. In many cases the declarative approach is preferable, but it requires a machine-readable data model for a Zarr hierarchy. Defining such a model is the central goal of this proposal. ## Definition of hierarchy structure This document distinguishes the *structure* of a Zarr hierarchy from the data stored in the hierarchy. The structure of a Zarr hierarchy is the layout of the tree of arrays and groups, and the metadata of those arrays and groups. This definition omits the data stored in the arrays, and the particular storage backend used to store data and metadata. By these definitions, two distinct Zarr hierarchies can have the same structure even if their arrays contain different values, and / or the hierarchies are stored using different storage backends. -Because the structure of a zarr hierarchy is decoupled, by definition, from the data stored in the hierarchy, it should be possible to represent the structure of a Zarr hierarchy with a compact data structure or interface. Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure. This document formalizes the Zarr Object Model (ZOM), an abstract model of the structure of a Zarr hierarchy. The ZOM serves as a foundation for tools that create and manipulate Zarr hierarchies at a structural level. - ## Specification of the base Zarr Object Model We begin with a definition of a "base" Zarr Object Model. On its own, the base ZOM is not useful for working with actual Zarr hierarchies, because it contains a reference to an unspecified Zarr version. By supplying definitions from a particular Zarr version, we can specialize the base ZOM and produce an object that can be used for doing actual work. From 20c9cc03394a5a7bf495facce65ce030e98245ec Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Sun, 29 Oct 2023 23:06:03 +0100 Subject: [PATCH 9/9] style --- draft/ZEP0006.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/draft/ZEP0006.md b/draft/ZEP0006.md index 90eab27..5d82cae 100644 --- a/draft/ZEP0006.md +++ b/draft/ZEP0006.md @@ -33,7 +33,7 @@ The Zarr specifications define models for arrays and groups, but the Zarr specif For example, the python library `xarray` defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata (a key called "_ARRAY_DIMENSIONS", which must have a list of strings as its value). The full specification of this format can be found [here](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding). For `xarray` (or any other application that works with structured hierarchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, `xarray` must first check if the potential source of data is an `xarray`-compliant Zarr hierarchy. -Creating and validating Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that creates a Zarr hiearchiy consistent with that definition. In many cases the declarative approach is preferable, but it requires a machine-readable data model for a Zarr hierarchy. Defining such a model is the central goal of this proposal. +These actions access Zarr hierarchies, not individual arrays or groups. Accessing Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that creates a Zarr hiearchiy consistent with that definition. The data models defined in the Zarr specifcations are only sufficient to design procedural Zarr hierarchy manipulation. This is a level of abstraction too low -- tasks like checking if an arbitrary Zarr group is compatible with the `xarray` format would easier with a declarative API. But a declarative hierarchy API requires defining data models for Zarr hierarchies. This is the central goal of this document. ## Definition of hierarchy structure