-
-
Notifications
You must be signed in to change notification settings - Fork 329
[v3] Generalized NDArray support #1751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would propose renaming this issue to Proposal for adding generalized ndarray support to zarr
How to adapt zarr v3 implementations using this interfaceChanges to codec pipelinesNo change to how the for bb_codec, chunk_spec_batch in bb_codecs_with_spec[::-1]:
chunk_bytes_batch = await bb_codec.decode_batch(
zip(chunk_bytes_batch, chunk_spec_batch), runtime_configuration
)
ab_codec, chunk_spec_batch = ab_codec_with_spec
chunk_array_batch = await ab_codec.decode_batch(
zip(chunk_bytes_batch, chunk_spec_batch), runtime_configuration
)
for aa_codec, chunk_spec_batch in aa_codecs_with_spec[::-1]:
chunk_array_batch = await aa_codec.decode_batch(
zip(chunk_array_batch, chunk_spec_batch), runtime_configuration
) Changes to codecsThe codec definitions would have to change a bit though. These would all accept this new class ArrayBytesCodec(Codec):
@abstractmethod
async def decode(
self,
chunk_bytes: zarr.ndarray,
chunk_spec: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> zarr.ndarray:
pass
@abstractmethod
async def encode(
self,
chunk_array: zarr.ndarray,
chunk_spec: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> Optional[zarr.ndarray]:
pass class ArrayBytesCodec(Codec):
@abstractmethod
async def decode(
self,
chunk_bytes: zarr.ndarray,
chunk_spec: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> zarr.ndarray:
pass
@abstractmethod
async def encode(
self,
chunk_array: zarr.ndarray,
chunk_spec: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> Optional[zarr.ndarray]:
pass Changes to storesStores would also need to accept/return class Store(ABC):
@abstractmethod
async def get(
self, key: str, byte_range: Optional[Tuple[int, Optional[int]]] = None
) -> Optional[zarr.ndarray]:
"""Retrieve the value associated with a given key.
Parameters
----------
key : str
byte_range : tuple[int, Optional[int]], optional
Returns
-------
zarr.ndarray
"""
...
@abstractmethod
async def set(self, key: str, value: zarr.ndarray) -> None:
"""Store a (key, value) pair.
Parameters
----------
key : str
value : zarr.ndarray
"""
... User facing APIs using this interfaceOption 1: Return/accept
|
Thanks @akshaysubr! Your proposal looks good. If there is a way that we could achieve the same results in just-Python that would be preferred. I don't know enough whether that is possible with the currently available library APIs. With the v3 refactoring, we aim to keep the end-user API as close to v2 as possible. Option 1 would be a significant change in that users would need to write |
Thanks for the feedback @normanrz. I think we should be able to do this in pure python but would need to add cupy as a dependency to zarr. That can maybe be made optional for CPU only installs and raise an appropriate exception if trying to access GPU functionality? |
Thaks sounds great! Adding cupy as an optional dependency would be a good solution. |
I propose we take it a step further and implement an output-by-caller policy. That is, all components in Zarr-python take an output buffer/array argument and write their results in said buffer/array instead of returning a new buffer/array. Con
Pros
I suggest that we introduce a Changes to codecs and storesclass ArrayBytesCodec(Codec):
@abstractmethod
async def decode(
self,
chunk_input: NDBufferView,
chunk_output: NDBufferView,
chunk_spec: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> None:
pass
@abstractmethod
async def encode(
self,
chunk_input: NDBufferView,
chunk_output: NDBufferView,
chunk_spec: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> None:
pass class Store(ABC):
@abstractmethod
async def get(
self,
key: str,
output: NDBufferView,
byte_range: Optional[Tuple[int, Optional[int]]] = None
) -> None:
...
@abstractmethod
async def set(self, key: str, value: NDBufferView) -> None:
... User facing APIs using this interfaceThe user can set the output buffer explicit when calling a |
@madsbk The output-by-caller policy is definitely very useful especially from a copy optimization and a memory allocator standpoint. The one con that you called out is where I am stuck right now. This challenge was also discussed in this numcodecs issue: zarr-developers/numcodecs#316 For each codec, we can implement what you're suggesting for the write/encode path by having each compressor provide an upper bound for the compressed size given the uncompressed size. Almost all compressors do this already and we'd just have to expose that in the Codec API. But for the read/decode path, this is much harder to do since we don't keep track of the compression ratio or compressed size at each codec stage in the codec pipeline. We can potentially solve that by adding this additional metadata through a zarr spec extension, but in the existing version of the spec, that doesn't seem feasible. This decode path is a general issue for compressors and finding a good longer term solution for that would be very useful. For example, the numcodecs LZ4 codec currently adds the compressed size as a 4 byte header, but this makes the LZ4 codec streams incompatible with other LZ4 decompressor implementations. The Gzip codec speculatively allocates some memory and pauses decompression to allocate a larger buffer and copy over data before continuing decompression. Most of the GPU codec implementations decompress once without writing anything out just to find the output size so the caller can allocate that much memory and then re-run decompression with output. |
I think we can close this issue now that #1910 has been merged? |
Recent discussions at the Refactor meeting have indicated indicated that adding generalized array support to the v3 roadmap would be useful.
@akshaysubr - can I ask you to take this ticket and push it forward?
The text was updated successfully, but these errors were encountered: