You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently all xarray .load() calls are blocking, so the only way to concurrently load data for a bunch of different xarray objects is to use dask. This comes up when loading data from high-latency backends such as Zarr on remote object storage.
Describe the solution you'd like
But now that zarr v3 has async get methods, it should be possible to add an async version of the .load() method that could be used like this:
For N zarr stores pointing to remote object storage, each of which has a latency of ~1s, this code could take in theory only ~1s, whereas the blocking equivalent (i.e. return [da.load() for da in dataarrays]) would take at least ~N seconds.
(Note this suggestion is not the same as #8965, which is about concurrently loading multiple variables behind the scenes, rather than exposing an async interface to the user.)
The new method could be da.async_load(), or even use an accessor namespace like da.async.load().
To make this work we would need to add an async version of BackendArray.get_duck_array
and plumb that down through to zarr's AsyncArray methods somehow.
Describe alternatives you've considered
Using dask is massive overhead and additional complexity. There may be some other way to do this that I'm not aware of.
Additional context
This is a desired-enough feature that other people have done it before in 3rd-party libraries, e.g. https://github.com/jeliashi/xarray-async. That particular implementation also targeted zarr, but predates the async get methods now available in zarr v3.
Actually the accessor syntax idea of having ds.async.load() is not possible because async is a reserved keyword in python, so ds.async raises a SyntaxError. So it would have to be one of:
Is your feature request related to a problem?
Currently all xarray
.load()
calls are blocking, so the only way to concurrently load data for a bunch of different xarray objects is to use dask. This comes up when loading data from high-latency backends such as Zarr on remote object storage.Describe the solution you'd like
But now that zarr v3 has async get methods, it should be possible to add an
async
version of the.load()
method that could be used like this:For N zarr stores pointing to remote object storage, each of which has a latency of ~1s, this code could take in theory only ~1s, whereas the blocking equivalent (i.e.
return [da.load() for da in dataarrays]
) would take at least ~N seconds.(Note this suggestion is not the same as #8965, which is about concurrently loading multiple variables behind the scenes, rather than exposing an async interface to the user.)
The new method could be
da.async_load()
, or even use an accessor namespace likeda.async.load()
.To make this work we would need to add an async version of
BackendArray.get_duck_array
xarray/xarray/backends/common.py
Line 273 in c8affb3
and plumb that down through to zarr's
AsyncArray
methods somehow.Describe alternatives you've considered
Using dask is massive overhead and additional complexity. There may be some other way to do this that I'm not aware of.
Additional context
This is a desired-enough feature that other people have done it before in 3rd-party libraries, e.g. https://github.com/jeliashi/xarray-async. That particular implementation also targeted zarr, but predates the async get methods now available in zarr v3.
cc @dcherian @rabernat @jhamman @ianhi
The text was updated successfully, but these errors were encountered: