Skip to content

(feat): typesize declared with constructor for Blosc #713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 4, 2025

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Mar 3, 2025

See zarr-developers/zarr-python#2766 and zarr-developers/zarr-python#2171 - I checked locally the reproducer in the later for the compression ratio, and we now are getting the old performance with 1383 bytes

I think a good place to start is to declare the itemsize information, which is stored anyway on the BloscCodec

This PR does no validation of typesize, although it could in theory if the incoming buffer is a numpy array. I'm not sure that's desireable though as this is purely a performance thing (i.e., a "wrong" typesize doesn't actually break compression/decompression). If you're so low-level that you're changing this while using arrays, and you get it wrong, you'd probably also notice that your files have bad compression ratios

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

@d-v-b
Copy link
Contributor

d-v-b commented Mar 3, 2025

this a good idea, but we should be sure that we don't cause backwards compatibility problems with this, since it adds a key to zarr metadata. How do various zarr clients like zarr-python or tensorstore handle extra keys in the blosc JSON definition?

@ilan-gold
Copy link
Contributor Author

this a good idea, but we should be sure that we don't cause backwards compatibility problems with this, since it adds a key to zarr metadata.

Why does it add a key? One of the reason I feel so secure doing this is precisely that I thought he were adding the key anyway: https://github.com/zarr-developers/zarr-python/blob/680142f6f370d862f305a9b37f3c0ce2ce802e76/src/zarr/codecs/blosc.py#L122-L136 so I felt extremely confident making this change. Maybe I'm misunderstanding

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Mar 3, 2025

This feels similar to the Zstd levels business - it's encode only, and can be safely (i.e., nothing breaks) ignored for reading back in. I understand why we serialize it, but it seems people sometimes ignore it anyway (thinking of the NVComp Zstd) with no effect (the utility/quality of such a decision can be debated probably)

@d-v-b
Copy link
Contributor

d-v-b commented Mar 3, 2025

this a good idea, but we should be sure that we don't cause backwards compatibility problems with this, since it adds a key to zarr metadata.

Why does it add a key? One of the reason I feel so secure doing this is precisely that I thought he were adding the key anyway: https://github.com/zarr-developers/zarr-python/blob/680142f6f370d862f305a9b37f3c0ce2ce802e76/src/zarr/codecs/blosc.py#L122-L136 so I felt extremely confident making this change. Maybe I'm misunderstanding

aha, I didn't see that we were serializing this already. So there's no backwards compatibility issues to worry about. which is great

Copy link

codecov bot commented Mar 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.96%. Comparing base (8168e15) to head (80ff632).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #713   +/-   ##
=======================================
  Coverage   99.96%   99.96%           
=======================================
  Files          63       63           
  Lines        2754     2771   +17     
=======================================
+ Hits         2753     2770   +17     
  Misses          1        1           
Files with missing lines Coverage Δ
numcodecs/tests/test_blosc.py 100.00% <100.00%> (ø)

@ilan-gold ilan-gold marked this pull request as ready for review March 3, 2025 15:39
@ilan-gold ilan-gold requested a review from dstansby March 3, 2025 15:40
@ilan-gold ilan-gold changed the title (feat): typesize declared with constructor (feat): typesize declared with constructor for Blosc Mar 4, 2025
@dstansby dstansby enabled auto-merge (squash) March 4, 2025 12:46
@dstansby dstansby merged commit 3c933cf into zarr-developers:main Mar 4, 2025
27 of 28 checks passed
@ilan-gold ilan-gold deleted the ig/itemsize_blosc branch March 4, 2025 12:54
LDeakin added a commit to zarrs/zarrs that referenced this pull request Apr 11, 2025
LDeakin added a commit to zarrs/zarrs that referenced this pull request Apr 11, 2025
@LDeakin
Copy link

LDeakin commented Apr 11, 2025

this a good idea, but we should be sure that we don't cause backwards compatibility problems with this, since it adds a key to zarr metadata. How do various zarr clients like zarr-python or tensorstore handle extra keys in the blosc JSON definition?

Indeed! This was a breaking change after all for implementations that reject unknown keys, as it seemingly was not serialised in the past. See this diff in https://github.com/LDeakin/zarrs/pull/174/files.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Apr 11, 2025

@LDeakin looking back, I misunderstood @d-v-b 's point - I thought he was referring to https://github.com/zarr-developers/zarr-python/blame/018f61d93207112f68eba06ea8c2560a489767f6/src/zarr/codecs/blosc.py#L127-L136 (which has no change) and not the fact that this codec is used directly by zarr v2 file-format

@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2025

ok so this was actually a breaking change, that's my bad, I definitely should have reviewed this more carefully. But we also really need to be catching this with tests.

@jbms @bogovicj sorry for the inconvenience but you might have to change how your implementations handle the blosc codec in zarr v2 data.

@dstansby
Copy link
Contributor

Sorry, I don't have time to follow in detail, but this isn't released yet (right?), so we could in theory revert it if it's too much of an issue?

@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2025

it was released in 0.16

@dstansby
Copy link
Contributor

Sorry, for some reason I thought this was zarr-python and not numcodecs 🤦

@jbms
Copy link

jbms commented Apr 11, 2025

Please revert/fix this to minimize breakage, e.g. avoid serializing the typesize parameter when it is None.

Zarr-python/numcodecs also fails on unknown keys so this also breaks compatibility with previous versions of numcodecs/zarr-python.

In general backwards compatibility is tricky so it would be great if you can take care when making any changes to the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants