-
-
Notifications
You must be signed in to change notification settings - Fork 357
Remove Store.set_partial_writes #3413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This feature was unused by the rest of Zarr-Python, and was only implemented for LocalStore and stores that wrap other stores. The Zarr v3 spec still mentions partial writes, so it should probably also be updated.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3413 +/- ##
==========================================
+ Coverage 94.70% 94.89% +0.19%
==========================================
Files 79 79
Lines 9549 9501 -48
==========================================
- Hits 9043 9016 -27
+ Misses 506 485 -21
🚀 New features to boost your workflow:
|
@d-v-b Yes, in theory partial writes would be helpful for writing into sharded arrays. But Cloud object stores (e.g., S3 and GCS) don't support partial writes, so Zarr-Python is never going to be able to make use of partial writes for the most performance critical stores. As for local and memory stores, I don't think there are good use-cases for sharding, because file system access deos not have the latency issues that make using small chunks problematic. It is possible that there are some distributed filesystems that support partial writes and would be able to realize some performance benefits. But in general filesystems don't support concurrent writes to the same files, so this could only happen from concurrent writer at a time and would require locking for safety. Overall, I don't think niche benefits of partial writes outweigh the complexity cost of including them in the Zarr spec. A better model is to treat shards as something that need to be written all at once in all cases. |
For S3, a combination of multi-part upload and multi-part copy operations could be used to implement a byte-range write.
The use of sharding on local storage is generally motivated by a desire to keep the number of files low, which reduces latency for any O(shards) operations, like listing the contents of a directory, or simply copying data with drag and drop. People absolutely use sharding for this, so it's the opposite of niche.
What complexity are we talking about here? An extra parameter in |
also the |
I haven't used S3 in quite a while, but my understanding is that multi-part upload is quite different from partial writes. You have to supply all the partial writes to override a file. You can't upload just one part at a time.
The complexity is having a core method in Zarr's core
Sure, we can keep In every other case, calling |
This is true, but I think you could initiate a multi-part upload, upload the part that needs to be updated, then use PUT part/copy for the rest. In this, you are supply some of the bytes locally, and the rest are sourced from an existing object. S3 experts should weigh in on whether this is actually possible.
While local and memory storage are only 2 of many possible stores, I think they are actually extremely important to Zarr and we should make every effort to ensure that we have really good performance with both these storage backends. For sharded, uncompressed data, separate chunks in a shard can be concurrently written, and I think this is potentially extremely valuable for a lot of applications, as it alleviates one of the big downsides to sharding (loss of write parallelism). And I think selectively writing byte ranges would also be useful for in-memory storage, when we get around to tuning the performance of that. For this PR, i'd recommend putting a deprecation warning on |
I would be curious what other Zarr maintainers think about the value of keeping an expanded API for this hypothetical use case (sharded stores without compression in local or memory stores). In my opinion, local and memory stores are very niche, and mostly relevant for testing. There are better file formats than Zarr for uncompressed data that fits on a single machine (e.g., HDF5, NPZ). Note that its only |
There are lots of zarr users who never touch cloud data. Such users have even told me that, for them, cloud storage is niche! So local storage is definitely used for more than testing. And as for memory storage, there are a lot of internal operations that could be refactored as store -> store transformations, where the source store is a memory store (for example, creating a set of metadata documents in memory storage, then copying that memory store to an external store). In this case, we want memory storage to be very performant.
This is a key point I had overlooked -- only |
That part of the spec is not intended to be normative, so
It certainly is a
Footnotes
|
I'm now in favor of the changes here. I think it's very unlikely that this causes any problems for downstream libraries. If / when we do get around to supporting byte-range writes on niche filesystems 😉 , we can safely change the signature of |
This feature was unused by the rest of Zarr-Python, and was only implemented for LocalStore and stores that wrap other stores.
The Zarr v3 spec still mentions partial writes, so it should probably also be updated.
Fixes #2859
TODO:
docs/user-guide/*.rst
changes/