Skip to content

[v3] Batch array / group access #1805

@d-v-b

Description

@d-v-b

In v3, since the storage API is asynchronous, we can open multiple array or groups concurrently. This would be broadly useful, but we don't have a good template from zarr-python v2 to extrapolate from, so we have to invent something new here (new, relative to zarr-python, that is).

Over in #1804 @martindurant brought this up, and I suggested something like this:

def open_nodes(store: Store, paths: tuple[str, ...], options: dict[Literal["array", "group"], dict[str, Any]]) -> Array | Group:
  ...
 
def open_arrays(store: Store, paths: tuple[str, ...], options: dict[str, Any]) -> Array:
  ...

def open_groups(store: Store, paths: tuple[str, ...], options: dict[str, Any]) -> Group:
  ...

I was imagining that the arguments to these functions would be the paths of arrays / groups anywhere in a Zarr hierarchy; we could also have a group.open_groups() method which can only "see" sub-groups, and similarly for group.open_arrays().

An alternative would be to use a more general transactional context manager:

with transaction(store) as tx:
     a1_maybe = tx.open_array(...)
     a2_maybe = tx.open_array(...)
    # IO gets run concurrently in `__aexit__`

a1 = a1_maybe.result()
a2 = a2_maybe.result()

I'm a lot less sure of this second design, since I have never implemented anything like it. For example, should we use futures for the results of tx.open_array()?

Are there other ideas, or examples from other implementations / domains we could draw from?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions