should we abstract over v2 and v3 codecs #2654

d-v-b · 2025-01-06T10:56:36Z

Over in #2647 @jni asked a the following question:

To help us fix the napari side, can someone point to a compressor= dict that we can pass to zarr.open that will work on zarr-python 2.x and 3.x with zarr v2 and v3 arrays? 🙏

Unfortunately, at this time there is no dict or codec class instance that can satisfy this question. By design, v2 and v3 chunk encoding are completely distinct entities. I wonder if this is wise.

For example, can someone explain why we really need two versions of a gzip codec (one in numcodecs, and one defined here)? From what I can tell, the only differences between these two gzip codecs are the JSON serialization: the numcodecs version serializes to {"id": "gzip", "level": <int>}, while the zarr v3 version serializes to {"name": "gzip", "configuration": {"level": <int>}}. Should users creating arrays in zarr-python have to care about this minor difference?

I don't think users need or want to care about the differences between zarr v2 and zarr v3 codec serialization. So I propose that we should allow code like create_array(...compressor=foo, zarr_format=2) and create_array(...compressor=foo, zarr_format=3) for the same value of foo.

Here's a simple short-term solution: For codecs like blosc and gzip that can be found in zarr v2 and v3, how about we allow functions like create_array accept either the zarr v2 or zarr v3 codec (or its dict form)?

Here's a more complex, longer term solution: all the codecs in numcodecs should be altered to produce either zarr v2 or zarr v3 JSON serializations. That is, the numcodecs Gzip should have a serialization method zarr v2 clients can use, and a separate serialization method that zarr v3 clients can use.

The text was updated successfully, but these errors were encountered:

jni · 2025-01-06T12:13:36Z

I don't think users need or want to care about the differences between zarr v2 and zarr v3 codec serialization. So I propose that we should allow code like create_array(...compressor=foo, zarr_format=2) and create_array(...compressor=foo, zarr_format=3) for the same value of foo.

I'm very +1 to this as it feels like it would be relatively small effort (maybe even fitting in before 3.0) (and acknowledging I am saying this from a "haven't really looked at the code" perspective) and would be very useful to help projects transition.

normanrz · 2025-01-06T15:18:58Z

I definitely wouldn't want to rush this.

More generally, I see the 3.0 release as a v3-first library, with v2 in support mode. The library should support reading all v2 data, but the incentive for new data should be on v3. That is why we switched the default zarr_format. Therefore, I am more interested in designing good APIs that work for v3 instead of trying to paper over the differences of the 2 format versions.

Anyways, adding an implicit conversion for a transitional period would be fine with me. But more like a hotfix than a real solution.

TomNicholas · 2025-01-06T15:59:34Z

I think this would also be useful for VirtualiZarr.

d-v-b · 2025-06-02T11:08:28Z

I have a concrete plan to propose:

Zarr-python defines a Numcodec protocol, which is basically any class that has a code_id attribute, an encode method, and a decode method. This will cover the codecs defined in numcodecs, as well as the codecs defined in imagecodecs, and any other codecs that use the numcodecs API.
Zarr-python defines a wrapper that imbues ArrayArrayCodec, ArrayBytesCodec, and BytesBytesCodec functionality to an arbitrary object that implements the Numcodec protocol. This means users can show up with arbitrary Numcodec instances, and zarr-python will make a best-effort attempt to wrap them. I.e., zarr.create_array(..., serializer=MyRandomV2CodecInstance) should just work, provided MyRandomV2CodecInstance can actually function as a serializer. We can't know until we run it.

For the Numcodec -> V3 codec wrappers, we will use what's currently in numcodecs.zarr3. We will not import it from numcodecs. That code can be safely deleted from numcodecs, resolving the circular dependency.
Zarr v3 codecs will have JSON serialization that is parameterized by the zarr format. That means the same codec class can serialize to the v2-style {id, ...} form, or the v3-style {name, configuration} form.

This will vastly simplify our codec logic, and support users providing numcodecs codecs without fuss.

d-v-b · 2025-06-02T11:10:10Z

Another benefit: by defining Numcodec as a protocol, we open up the possibility of making numcodecs itself an optional dependency.

And another important change we need to make for testing: every codec with a spec available on zarr-extensions should be tested against that spec. This means we do not blindly import stuff from numcodecs and assume it generates the right codec JSON.

normanrz · 2025-06-02T14:47:17Z

I think your proposal could work. I have a number of questions that might be answered by some concrete class/method signatures:

What does Numcodec.encode and Numcodec.decode take and return as arguments?
Is the protocol limited to numpy?
How do we make sure the metadata is compatible with both v2 and v3 in case there are slight differences (e.g. shuffle in blosc)?
How do we deal with codecs that are not registered extensions in v3? What will be their identifier in the metadata?

d-v-b · 2025-06-02T15:20:43Z

What does Numcodec.encode and Numcodec.decode take and return as arguments?

I was thinking Buffer | NDBuffer. Zarr python has a lot of control here, as the protocol has to be defined in a way that's compatible with our codec APIs, and numcodecs itself is not properly typed (thanks, pickle codec).

Is the protocol limited to numpy?

Again, up to us. If we use our own (numpy-compatible) data structures, then no. Practically everyone who has implemented the numcodecs API (numcodecs itself, and imagecodecs, and probably numcodecs-rs) is assuming numpy arrays, but I don't think we need to be limited to that.

How do we make sure the metadata is compatible with both v2 and v3 in case there are slight differences (e.g. shuffle in blosc)?

For codecs like blosc, where a v3 spec exists, would replace a user-provided numcodecs.Blosc codec with the blosc codec defined inside Zarr Python. This would only occur in convenience routines like create_array. Users who really want to use a non-spec-compliant blosc codec are free to construct array metadata explicitly with that off-spec codec.

How do we deal with codecs that are not registered extensions in v3? What will be their identifier in the metadata?

We use the codec_id attribute as the name, and the rest of the config parameters go in configuration. And socially, we spread the word about zarr-extensions.

normanrz · 2025-06-02T15:23:07Z

What does Numcodec.encode and Numcodec.decode take and return as arguments?

I was thinking Buffer | NDBuffer.

Wouldn't that require numcodecs to import Buffer and/or NDBuffer from zarr-python?

d-v-b · 2025-06-02T15:24:00Z

I am proposing that we do all of this in zarr python

normanrz · 2025-06-02T15:26:20Z

I thought numcodecs would implement the Numcodec protocol?
I guess a quick draft PR would be super useful here!

d-v-b · 2025-06-02T15:29:36Z

since numcodecs defines an abc already, there's not really a need for a protocol -- all the codecs in numcodecs can use regular inheritance. But for Zarr Python, we don't want to depend on nominal typing based on an external library (numcodecs).

So we would use the Numcodec protocol in order to avoid explicitly depending on numcodecs. This puts us in a position to make numcodecs an optional dependency, while still explicitly handling numcodecs or imagecodecs classes for people who need them. A PR will make this more clear.

This was referenced Feb 1, 2025

Monthly issue metrics report #2787

Closed

Monthly issue metrics report sanketverma1704/zarr-python#7

Open

Monthly issue metrics report enthusiastdev121/zarr-python#16

Open

d-v-b mentioned this issue Feb 5, 2025

Its possible to write zarr 2 format zarr's using zarr3 that can't be read by zarr v2 #2773

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

should we abstract over v2 and v3 codecs #2654

should we abstract over v2 and v3 codecs #2654

d-v-b commented Jan 6, 2025 •

edited

Loading

jni commented Jan 6, 2025

Uh oh!

normanrz commented Jan 6, 2025

Uh oh!

TomNicholas commented Jan 6, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

normanrz commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025 •

edited

Loading

Uh oh!

normanrz commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

normanrz commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

should we abstract over v2 and v3 codecs #2654

should we abstract over v2 and v3 codecs #2654

Comments

d-v-b commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jni commented Jan 6, 2025

Uh oh!

normanrz commented Jan 6, 2025

Uh oh!

TomNicholas commented Jan 6, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

normanrz commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanrz commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

normanrz commented Jun 2, 2025

Uh oh!

d-v-b commented Jun 2, 2025

Uh oh!

d-v-b commented Jan 6, 2025 •

edited

Loading

d-v-b commented Jun 2, 2025 •

edited

Loading