Skip to content

Conversation

shoyer
Copy link
Contributor

@shoyer shoyer commented Aug 28, 2025

This feature was unused by the rest of Zarr-Python, and was only implemented for LocalStore and stores that wrap other stores.

The Zarr v3 spec still mentions partial writes, so it should probably also be updated.

Fixes #2859

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

shoyer added 3 commits August 28, 2025 10:55
This feature was unused by the rest of Zarr-Python, and was only
implemented for LocalStore and stores that wrap other stores.

The Zarr v3 spec still mentions partial writes, so it should probably
also be updated.
Copy link

codecov bot commented Aug 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.89%. Comparing base (710f5df) to head (de0e0a8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3413      +/-   ##
==========================================
+ Coverage   94.70%   94.89%   +0.19%     
==========================================
  Files          79       79              
  Lines        9549     9501      -48     
==========================================
- Hits         9043     9016      -27     
+ Misses        506      485      -21     
Files with missing lines Coverage Δ
src/zarr/abc/store.py 95.74% <100.00%> (-0.06%) ⬇️
src/zarr/storage/_common.py 92.89% <100.00%> (+0.46%) ⬆️
src/zarr/storage/_fsspec.py 91.01% <ø> (+0.40%) ⬆️
src/zarr/storage/_local.py 98.23% <100.00%> (+5.84%) ⬆️
src/zarr/storage/_logging.py 100.00% <ø> (+2.11%) ⬆️
src/zarr/storage/_memory.py 93.87% <ø> (+0.80%) ⬆️
src/zarr/storage/_obstore.py 93.71% <ø> (-0.15%) ⬇️
src/zarr/storage/_wrapper.py 100.00% <ø> (+1.20%) ⬆️
src/zarr/storage/_zip.py 97.60% <ø> (+0.54%) ⬆️
src/zarr/testing/stateful.py 98.84% <ø> (+0.55%) ⬆️
... and 1 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 28, 2025

@d-v-b Yes, in theory partial writes would be helpful for writing into sharded arrays. But Cloud object stores (e.g., S3 and GCS) don't support partial writes, so Zarr-Python is never going to be able to make use of partial writes for the most performance critical stores.

As for local and memory stores, I don't think there are good use-cases for sharding, because file system access deos not have the latency issues that make using small chunks problematic.

It is possible that there are some distributed filesystems that support partial writes and would be able to realize some performance benefits. But in general filesystems don't support concurrent writes to the same files, so this could only happen from concurrent writer at a time and would require locking for safety.

Overall, I don't think niche benefits of partial writes outweigh the complexity cost of including them in the Zarr spec. A better model is to treat shards as something that need to be written all at once in all cases.

@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2025

@d-v-b Yes, in theory partial writes would be helpful for writing into sharded arrays. But Cloud object stores (e.g., S3 and GCS) don't support partial writes, so Zarr-Python is never going to be able to make use of partial writes for the most performance critical stores.

For S3, a combination of multi-part upload and multi-part copy operations could be used to implement a byte-range write.

As for local and memory stores, I don't think there are good use-cases for sharding, because file system access deos not have the latency issues that make using small chunks problematic.

The use of sharding on local storage is generally motivated by a desire to keep the number of files low, which reduces latency for any O(shards) operations, like listing the contents of a directory, or simply copying data with drag and drop. People absolutely use sharding for this, so it's the opposite of niche.

It is possible that there are some distributed filesystems that support partial writes and would be able to realize some performance benefits. But in general filesystems don't support concurrent writes to the same files, so this could only happen from concurrent writer at a time and would require locking for safety.

Overall, I don't think niche benefits of partial writes outweigh the complexity cost of including them in the Zarr spec. A better model is to treat shards as something that need to be written all at once in all cases.

What complexity are we talking about here? An extra parameter in set?

@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2025

also the zarr.abc.store.Store is public API, and there are definitely people using this interface downstream of Zarr Python. we can't just remove keyword arguments. If there's a decision to deprecate the byte_range parameter in set, we need a deprecation cycle.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 28, 2025

For S3, a combination of multi-part upload and multi-part copy operations could be used to implement a byte-range write.

I haven't used S3 in quite a while, but my understanding is that multi-part upload is quite different from partial writes. You have to supply all the partial writes to override a file. You can't upload just one part at a time.

Overall, I don't think niche benefits of partial writes outweigh the complexity cost of including them in the Zarr spec. A better model is to treat shards as something that need to be written all at once in all cases.

What complexity are we talking about here? An extra parameter in set?

The complexity is having a core method in Zarr's core Store inference (one of only 7) that Zarr doesn't actually use, and one that even in theory could only be supported by a small fraction of stores, with major cavaets around concurrency.

also the zarr.abc.store.Store is public API, and there are definitely people using this interface downstream of Zarr Python. we can't just remove keyword arguments.

Sure, we can keep set_partial_values around on LocalStore, LoggingStore and WrapperStore for now if you like for now with a deprecation warning, even though it has zero test coverage.

In every other case, calling set_partial_values or supplying a byte_range parameter in set raises NotImplementedError. I don't think that's much more useful than AttributeError.

@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2025

For S3, a combination of multi-part upload and multi-part copy operations could be used to implement a byte-range write.

I haven't used S3 in quite a while, but my understanding is that multi-part upload is quite different from partial writes. You have to supply all the partial writes to override a file. You can't upload just one part at a time.

This is true, but I think you could initiate a multi-part upload, upload the part that needs to be updated, then use PUT part/copy for the rest. In this, you are supply some of the bytes locally, and the rest are sourced from an existing object. S3 experts should weigh in on whether this is actually possible.

The complexity is having a core method in Zarr's core Store inference (one of only 7) that Zarr doesn't actually use, and one that even in theory could only be supported by a small fraction of stores, with major cavaets around concurrency.

While local and memory storage are only 2 of many possible stores, I think they are actually extremely important to Zarr and we should make every effort to ensure that we have really good performance with both these storage backends. For sharded, uncompressed data, separate chunks in a shard can be concurrently written, and I think this is potentially extremely valuable for a lot of applications, as it alleviates one of the big downsides to sharding (loss of write parallelism). And I think selectively writing byte ranges would also be useful for in-memory storage, when we get around to tuning the performance of that.

For this PR, i'd recommend putting a deprecation warning on set_partial_values because we definitely don't need that if set takes a byte range. My feeling is that we keep the byte_range parameter on set until it causes concrete problems for someone.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 28, 2025

I would be curious what other Zarr maintainers think about the value of keeping an expanded API for this hypothetical use case (sharded stores without compression in local or memory stores).

In my opinion, local and memory stores are very niche, and mostly relevant for testing. There are better file formats than Zarr for uncompressed data that fits on a single machine (e.g., HDF5, NPZ).

Note that its only StorePath.set that has a byte_range parameter, not Store.set.

@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2025

In my opinion, local and memory stores are very niche, and mostly relevant for testing. There are better file formats than Zarr for uncompressed data that fits on a single machine (e.g., HDF5, NPZ).

There are lots of zarr users who never touch cloud data. Such users have even told me that, for them, cloud storage is niche! So local storage is definitely used for more than testing.

And as for memory storage, there are a lot of internal operations that could be refactored as store -> store transformations, where the source store is a memory store (for example, creating a set of metadata documents in memory storage, then copying that memory store to an external store). In this case, we want memory storage to be very performant.

Note that its only StorePath.set that has a byte_range parameter, not Store.set.

This is a key point I had overlooked -- only StorePath.set (and the ByteGetter protocol's set) take the byte range. That makes the change in this PR much lower impact than I initially thought. I think we can be confident that nobody is subclassing StorePath and implementing set with a byte range, and I also think we can ultimately remove the set_partial_values from the base store class. The only question is whether we do this immediately, or gradually (via deprecations).

@LDeakin
Copy link
Member

LDeakin commented Aug 29, 2025

I would be curious what other Zarr maintainers think about the value of keeping an expanded API

Overall, I don't think niche benefits of partial writes outweigh the complexity cost of including them in the Zarr spec.

That part of the spec is not intended to be normative, so zarr-python does not have to implement it.

A better model is to treat shards as something that need to be written all at once in all cases.

It certainly is a better simpler model, but I thought it worth mentioning that I have found utility in partial writing with zarrs, at least on niche local stores 🥲. The sharding codec supports incrementally (but not concurrently1) updating / appending inner chunks and updating the index of a shard (via partial encoding). So, I can independently control:

  • the shard shape (to keep inode usage down),
  • the write granularity (to limit memory usage when processing), and
  • the inner chunk shape / read granularity (for efficient "cloud visualisation" via neuroglancer etc).

Footnotes

  1. There are major concerns around synchronisation with partial writing. In zarrs, the responsibility is placed on the consumer to ensure that partial write operations are not executed concurrently on the same chunk (shard).

@d-v-b
Copy link
Contributor

d-v-b commented Aug 29, 2025

I'm now in favor of the changes here. I think it's very unlikely that this causes any problems for downstream libraries.

If / when we do get around to supporting byte-range writes on niche filesystems 😉 , we can safely change the signature of set to set(key, value, *, byte_range = None), i.e. making byte_range keyword-only, or we can add a new method as needed. This can be driven by a concrete use case, like intra-shard chunk writes.

@d-v-b d-v-b merged commit 6d4b5e7 into zarr-developers:main Aug 29, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StoreTests don't test Store.set_partial_values
3 participants