-
-
Notifications
You must be signed in to change notification settings - Fork 329
Codec pipeline memory usage #2904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
https://github.com/TomAugspurger/zarr-python-memory-benchmark/blob/4039ba687452d65eef081bce1d4714165546422a/sol.py#L41 has a POC for using https://github.com/TomAugspurger/zarr-python-memory-benchmark/blob/4039ba687452d65eef081bce1d4714165546422a/sol.py#L63 shows an example reading a Zstd compressed dataset. https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/3567246b852d7adacbc10f32a58b0b3f6ac3d50b/reports/memray-flamegraph-sol-read-compressed.html shows that the peak memory usage is ~ the size of the compressed dataset + the output ndarray (this does all the decompression first; we could do those sequentially to lower the peak memory usage). There are some complications around slices that don't align with zarr chunk boundaries that this ignores, but is maybe enough to prove that we could do better. |
Thanks for doing this work @TomAugspurger! Coincidentally, I've been looking at memory overheads for Zarr storage operations across different filesystems (local/cloud), compression settings, and Zarr versions: https://github.com/tomwhite/memray-array
Just reducing the number of buffer copies for aligned slices would be a big win for everyone who uses Zarr, since it would improve performance and reduce memory pressure. Hopefully similar techniques could be used for cloud storage too. |
Very cool!
I was wondering about this while looking into the performance of obstore and KvikIO. KvikIO lets the caller provide the |
I wonder if any of the memory management machinery that has been developed for Apache Arrow would be of use here? |
I looked into implementing this today and it'll be a decent amount of effort. There are some issues in the interface provided by the codec pipeline ABC ( Beyond the codec pipeline, I think we'll also need to update the |
Not the first person! I did made it out alive, but only barely. |
On the weekly call today, Davis asked about whether we could do zero-copy read / decompression for variable-width / length types. For fixed-size types, we can derive that as For zero-copy decompression we just need the size of the final buffer. Libraries like pyarrow always(?) know this for their variable sized buffers. I think this would be possible to support, but we'd need to ensure that the metadata includes the chunk size. This would be an example of a chunk-level statistic (zarr-developers/zarr-specs#319). |
Zarr Python v2 actually does this already: https://github.com/zarr-developers/zarr-python/blob/support/v2/zarr/core.py#L2044-L2050 |
FYI here's my hacky attempt to do something similar in v3: tomwhite@9b5e7fc |
Nice work. https://github.com/TomAugspurger/zarr-python/blob/tom/zero-copy-codec-pipeline/tests/test_memory_usage.py has the start of a test that uses tracemalloc to ensure no unexpected NumPy array allocations are made. This should enable us to verify we aren't unexpectedly allocating arrays, assuming the decompressor isn't allocating memory using numpy internally. |
We discussed memory usage on Friday's community call. https://github.com/TomAugspurger/zarr-python-memory-benchmark started to look at some stuff.
https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-uncompressed.html has the memray flamegraph for reading an uncompressed array (400 MB total, split into 10 chunks of 40 MB each). I think the optimal memory usage here is about 400 MB. Our peak memory is about 2x that.
https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-compressed.html has the zstd compressed version. Peak memory is about 1.1 GiB.
I haven't looked too closely at the code, but I wonder if we could be smarter about a few things in certain cases:
readinto
directly into (an appropriate slice of)theout
array. We might need to expand the Store API to add some kind ofreadinto
, where the user provides the buffer to read into rather than the store allocating new memory.zstd.decode
takes an output buffer here that we could maybe use. And past that point, maybe all the codecs could reuse one or two buffers, rather than allocating a new buffer for each stage of the codec (one buffer if doing stuff inplace, two buffers if something can't be done inplace)?I'm not too familiar with the codec pipeline stuff, but will look into this as I have time. Others should feel free to take this if someone gets an itch though. There's some work to be done :)
The text was updated successfully, but these errors were encountered: