Codec pipeline memory usage #2904

TomAugspurger · 2025-03-10T20:21:12Z

We discussed memory usage on Friday's community call. https://github.com/TomAugspurger/zarr-python-memory-benchmark started to look at some stuff.

https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-uncompressed.html has the memray flamegraph for reading an uncompressed array (400 MB total, split into 10 chunks of 40 MB each). I think the optimal memory usage here is about 400 MB. Our peak memory is about 2x that.

https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-compressed.html has the zstd compressed version. Peak memory is about 1.1 GiB.

I haven't looked too closely at the code, but I wonder if we could be smarter about a few things in certain cases:

For the uncompressed case, we might be able to do a readinto directly into (an appropriate slice of)the out array. We might need to expand the Store API to add some kind of readinto, where the user provides the buffer to read into rather than the store allocating new memory.
For the compressed case, we might be able to improve things once we know the size of the output buffers. I see that numcodec's zstd.decode takes an output buffer here that we could maybe use. And past that point, maybe all the codecs could reuse one or two buffers, rather than allocating a new buffer for each stage of the codec (one buffer if doing stuff inplace, two buffers if something can't be done inplace)?

I'm not too familiar with the codec pipeline stuff, but will look into this as I have time. Others should feel free to take this if someone gets an itch though. There's some work to be done :)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2025-03-11T21:15:55Z

https://github.com/TomAugspurger/zarr-python-memory-benchmark/blob/4039ba687452d65eef081bce1d4714165546422a/sol.py#L41 has a POC for using readinto to read an uncompressed zarr dataset into a pre-allocated buffer. https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/3567246b852d7adacbc10f32a58b0b3f6ac3d50b/reports/memray-flamegraph-sol-read-uncompressed.html shows that that takes ~exactly the size of the output ndarray (so no overhead from Zarr).

https://github.com/TomAugspurger/zarr-python-memory-benchmark/blob/4039ba687452d65eef081bce1d4714165546422a/sol.py#L63 shows an example reading a Zstd compressed dataset. https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/3567246b852d7adacbc10f32a58b0b3f6ac3d50b/reports/memray-flamegraph-sol-read-compressed.html shows that the peak memory usage is ~ the size of the compressed dataset + the output ndarray (this does all the decompression first; we could do those sequentially to lower the peak memory usage).

There are some complications around slices that don't align with zarr chunk boundaries that this ignores, but is maybe enough to prove that we could do better.

tomwhite · 2025-03-12T08:55:47Z

Thanks for doing this work @TomAugspurger! Coincidentally, I've been looking at memory overheads for Zarr storage operations across different filesystems (local/cloud), compression settings, and Zarr versions: https://github.com/tomwhite/memray-array

There are some complications around slices that don't align with zarr chunk boundaries that this ignores, but is maybe enough to prove that we could do better.

Just reducing the number of buffer copies for aligned slices would be a big win for everyone who uses Zarr, since it would improve performance and reduce memory pressure. Hopefully similar techniques could be used for cloud storage too.

TomAugspurger · 2025-03-12T13:14:33Z

Very cool!

[from https://github.com/tomwhite/memray-array] Reads with no compression incur a single copy from local files, but two copies from S3. This seems to be because the S3 libraries read lots of small blocks then join them into a larger one, whereas local files can be read in one go into a single buffer.

I was wondering about this while looking into the performance of obstore and KvikIO. KvikIO lets the caller provide the out buffer that the data are read into, which lets you avoid the smaller buffer allocations and the set of memcopies into the final output buffer. Probably worth looking into at some point.

tomwhite · 2025-03-12T14:13:28Z

I wonder if any of the memory management machinery that has been developed for Apache Arrow would be of use here?

TomAugspurger · 2025-04-02T21:39:18Z

I looked into implementing this today and it'll be a decent amount of effort. There are some issues in the interface provided by the codec pipeline ABC (read takes an out buffer, but decode doesn't) and I got pretty lost in the codec_pipeline implementation (so many iterables of tuples!). I'm not sure where the best place to start is.

Beyond the codec pipeline, I think we'll also need to update the Store and Codec interfaces to add APIs for reading / decoding into an out buffer. This probably has to be opt in (we can't have codecs / stores silently not using an out buffer).

dcherian · 2025-04-04T14:47:42Z

and I got pretty lost in the codec_pipeline implementation (so many iterables of tuples!)

Not the first person! I did made it out alive, but only barely.

TomAugspurger · 2025-04-04T19:20:29Z

On the weekly call today, Davis asked about whether we could do zero-copy read / decompression for variable-width / length types.

For fixed-size types, we can derive that as chunk.dtype.itemsize * chunk.size. This doesn't work for variable-width types because the itemsize is, by definition, variable.

For zero-copy decompression we just need the size of the final buffer. Libraries like pyarrow always(?) know this for their variable sized buffers. I think this would be possible to support, but we'd need to ensure that the metadata includes the chunk size. This would be an example of a chunk-level statistic (zarr-developers/zarr-specs#319).

tomwhite · 2025-04-07T15:34:10Z

2. For the compressed case, we might be able to improve things once we know the size of the output buffers. I see that numcodec's zstd.decode takes an output buffer here that we could maybe use.

Zarr Python v2 actually does this already: https://github.com/zarr-developers/zarr-python/blob/support/v2/zarr/core.py#L2044-L2050

tomwhite · 2025-05-13T09:38:28Z

Zarr Python v2 actually does this already: https://github.com/zarr-developers/zarr-python/blob/support/v2/zarr/core.py#L2044-L2050

FYI here's my hacky attempt to do something similar in v3: tomwhite@9b5e7fc

TomAugspurger · 2025-05-13T14:42:13Z

Nice work. https://github.com/TomAugspurger/zarr-python/blob/tom/zero-copy-codec-pipeline/tests/test_memory_usage.py has the start of a test that uses tracemalloc to ensure no unexpected NumPy array allocations are made. This should enable us to verify we aren't unexpectedly allocating arrays, assuming the decompressor isn't allocating memory using numpy internally.

TomAugspurger changed the title ~~Memory usage~~ Codec pipeline memory usage Mar 10, 2025

TomAugspurger added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Mar 10, 2025

TomAugspurger mentioned this issue Mar 18, 2025

Zstd Codec on the GPU #2863

Draft

9 tasks

This was referenced Apr 1, 2025

Monthly issue metrics report #2941

Open

Monthly issue metrics report sanketverma1704/zarr-python#9

Open

TomAugspurger mentioned this issue May 1, 2025

Device support in zarr-python (especially for GPU) #2658

Open

TomAugspurger mentioned this issue May 9, 2025

Refactor CodecPipeline for flexibility #3051

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codec pipeline memory usage #2904

Codec pipeline memory usage #2904

TomAugspurger commented Mar 10, 2025 •

edited

Loading

TomAugspurger commented Mar 11, 2025

tomwhite commented Mar 12, 2025

TomAugspurger commented Mar 12, 2025

tomwhite commented Mar 12, 2025

TomAugspurger commented Apr 2, 2025

dcherian commented Apr 4, 2025

TomAugspurger commented Apr 4, 2025

tomwhite commented Apr 7, 2025

tomwhite commented May 13, 2025

TomAugspurger commented May 13, 2025

Codec pipeline memory usage #2904

Codec pipeline memory usage #2904

Comments

TomAugspurger commented Mar 10, 2025 • edited Loading

TomAugspurger commented Mar 11, 2025

tomwhite commented Mar 12, 2025

TomAugspurger commented Mar 12, 2025

tomwhite commented Mar 12, 2025

TomAugspurger commented Apr 2, 2025

dcherian commented Apr 4, 2025

TomAugspurger commented Apr 4, 2025

tomwhite commented Apr 7, 2025

tomwhite commented May 13, 2025

TomAugspurger commented May 13, 2025

TomAugspurger commented Mar 10, 2025 •

edited

Loading