-
-
Notifications
You must be signed in to change notification settings - Fork 330
should we abstract over v2 and v3 codecs #2654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm very +1 to this as it feels like it would be relatively small effort (maybe even fitting in before 3.0) (and acknowledging I am saying this from a "haven't really looked at the code" perspective) and would be very useful to help projects transition. |
I definitely wouldn't want to rush this. More generally, I see the 3.0 release as a v3-first library, with v2 in support mode. The library should support reading all v2 data, but the incentive for new data should be on v3. That is why we switched the default Anyways, adding an implicit conversion for a transitional period would be fine with me. But more like a hotfix than a real solution. |
I think this would also be useful for VirtualiZarr. |
I have a concrete plan to propose:
This will vastly simplify our codec logic, and support users providing numcodecs codecs without fuss. |
Another benefit: by defining And another important change we need to make for testing: every codec with a spec available on |
I think your proposal could work. I have a number of questions that might be answered by some concrete class/method signatures:
|
I was thinking
Again, up to us. If we use our own (numpy-compatible) data structures, then no. Practically everyone who has implemented the numcodecs API (numcodecs itself, and imagecodecs, and probably numcodecs-rs) is assuming numpy arrays, but I don't think we need to be limited to that.
For codecs like blosc, where a v3 spec exists, would replace a user-provided
We use the |
Wouldn't that require numcodecs to import |
I am proposing that we do all of this in zarr python |
I thought numcodecs would implement the |
since numcodecs defines an abc already, there's not really a need for a protocol -- all the codecs in numcodecs can use regular inheritance. But for Zarr Python, we don't want to depend on nominal typing based on an external library (numcodecs). So we would use the |
Uh oh!
There was an error while loading. Please reload this page.
Over in #2647 @jni asked a the following question:
Unfortunately, at this time there is no dict or codec class instance that can satisfy this question. By design, v2 and v3 chunk encoding are completely distinct entities. I wonder if this is wise.
For example, can someone explain why we really need two versions of a gzip codec (one in numcodecs, and one defined here)? From what I can tell, the only differences between these two gzip codecs are the JSON serialization: the numcodecs version serializes to
{"id": "gzip", "level": <int>}
, while the zarr v3 version serializes to{"name": "gzip", "configuration": {"level": <int>}}
. Should users creating arrays inzarr-python
have to care about this minor difference?I don't think users need or want to care about the differences between zarr v2 and zarr v3 codec serialization. So I propose that we should allow code like
create_array(...compressor=foo, zarr_format=2)
andcreate_array(...compressor=foo, zarr_format=3)
for the same value offoo
.Here's a simple short-term solution: For codecs like blosc and gzip that can be found in zarr v2 and v3, how about we allow functions like
create_array
accept either the zarr v2 or zarr v3 codec (or its dict form)?Here's a more complex, longer term solution: all the codecs in
numcodecs
should be altered to produce either zarr v2 or zarr v3 JSON serializations. That is, the numcodecsGzip
should have a serialization method zarr v2 clients can use, and a separate serialization method that zarr v3 clients can use.The text was updated successfully, but these errors were encountered: