Skip to content

Add an asynchronous load method? #10326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomNicholas opened this issue May 16, 2025 · 1 comment · May be fixed by #10327
Open

Add an asynchronous load method? #10326

TomNicholas opened this issue May 16, 2025 · 1 comment · May be fixed by #10327

Comments

@TomNicholas
Copy link
Member

TomNicholas commented May 16, 2025

Is your feature request related to a problem?

Currently all xarray .load() calls are blocking, so the only way to concurrently load data for a bunch of different xarray objects is to use dask. This comes up when loading data from high-latency backends such as Zarr on remote object storage.

Describe the solution you'd like

But now that zarr v3 has async get methods, it should be possible to add an async version of the .load() method that could be used like this:

async def load_many_dataarrays_concurrently(dataarrays):
    tasks = [da.async_load() for da in dataarrays]
    results = await asyncio.gather(*tasks)
    return results

For N zarr stores pointing to remote object storage, each of which has a latency of ~1s, this code could take in theory only ~1s, whereas the blocking equivalent (i.e. return [da.load() for da in dataarrays]) would take at least ~N seconds.

(Note this suggestion is not the same as #8965, which is about concurrently loading multiple variables behind the scenes, rather than exposing an async interface to the user.)

The new method could be da.async_load(), or even use an accessor namespace like da.async.load().

To make this work we would need to add an async version of BackendArray.get_duck_array

def get_duck_array(self, dtype: np.typing.DTypeLike = None):

and plumb that down through to zarr's AsyncArray methods somehow.

Describe alternatives you've considered

Using dask is massive overhead and additional complexity. There may be some other way to do this that I'm not aware of.

Additional context

This is a desired-enough feature that other people have done it before in 3rd-party libraries, e.g. https://github.com/jeliashi/xarray-async. That particular implementation also targeted zarr, but predates the async get methods now available in zarr v3.

cc @dcherian @rabernat @jhamman @ianhi

@TomNicholas TomNicholas linked a pull request May 16, 2025 that will close this issue
10 tasks
@TomNicholas
Copy link
Member Author

Actually the accessor syntax idea of having ds.async.load() is not possible because async is a reserved keyword in python, so ds.async raises a SyntaxError. So it would have to be one of:

ds.async_.load()
ds.async_load()
ds.load_async()

or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant