-
-
Notifications
You must be signed in to change notification settings - Fork 361
Add async oindex and vindex methods to AsyncArray #3083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3083 +/- ##
==========================================
+ Coverage 60.68% 60.73% +0.04%
==========================================
Files 78 78
Lines 9356 9407 +51
==========================================
+ Hits 5678 5713 +35
- Misses 3678 3694 +16
🚀 New features to boost your workflow:
|
@dcherian suggested making the sync oindex and vindex getitem methods call the new async versions. EDIT: I think this is already the case? |
@property | ||
def oindex(self) -> AsyncOIndex[T_ArrayMetadata]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose this API to try to follow this pattern:
Array.__getitem__
(exists)Array.oindex.__getitem__
(exists)Array.vindex.__getitem__
(exists)AsyncArray.getitem
(exists)AsyncArray.oindex.getitem
(new)AsyncArray.vindex.getitem
(new)
because python doesn't let you make an async version of the __getitem__
magic method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the sync versions to use sync(self._async_array.oindex.getitem)
instead of going to _get_selection
directly? That will get us some test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds very sensible, but when I try to refactor to do that I find that it causes a problem where sync()
gets called from within an async coroutine. I wouldn't have thought this should happen, but here's claude's explanation of why it does:
> Explain how exactly sync is being called from within an async coroutine
⏺ Looking at the error and the code, here's exactly how sync() is being called from within an async coroutine:
The Call Chain
1. Test calls zarr array indexing: za[zix] in test_indexing_with_zarr_array
2. Array.getitem (src/zarr/core/array.py:2566):
return self.vindex[cast("CoordinateSelection | MaskSelection", selection)]
3. VIndex.getitem (src/zarr/core/indexing.py:1248):
return sync(self.array._async_array.vindex.getitem(selection))
4. sync() tries to run the coroutine but detects it's already in a running loop
The Problem
The issue is in the __array__ method call chain:
1. VIndex.__getitem__ calls sync() with self.array._async_array.vindex.getitem(selection)
2. This eventually leads to AsyncArray._get_selection() being called
3. But somewhere in the process, Array.__array__() gets called (src/zarr/core/array.py:2413)
4. __array__() calls self[...] which goes back to Array.__getitem__
5. This creates a nested call where sync() is called while already inside an async context
The Root Cause
The function _zarr_array_to_int_or_bool_array() at line 85 calls np.asarray(arr), which triggers the __array__ protocol on the zarr array. This causes:
def _zarr_array_to_int_or_bool_array(arr: Array) -> npt.NDArray[np.intp] | npt.NDArray[np.bool_]:
if arr.dtype.kind in ("i", "b"):
return np.asarray(arr) # <-- This calls arr.__array__()
When np.asarray() is called on a zarr Array, it calls Array.__array__(), which calls self[...], which eventually calls sync() again - but we're already
inside a sync() call from the VIndex, creating the nested async context error.
The original code before the changes avoided this by handling the zarr array conversion within the sync methods directly, rather than delegating to async
methods that would create this nested sync situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for indexing with a Zarr array, we should convert to numpy array before the sync call
Shall we also have the sync array getitme methods use these async methods? zarr-python/src/zarr/core/array.py Lines 2785 to 2792 in 6fa9f37
|
Yea I would like to, but don't fully understand how to get that to work. So I thought I could leave that for a follow-up. |
OK but presumably that errors means async indexing with Zarr arrays also doesn't work (https://github.com/zarr-developers/zarr-python/pull/3083/files#r2231114456). Can you open an issue to track please? |
I didn't even know it was possible to index a zarr array with another zarr array!
Actually I just added a test that seems to show that indexing a zarr array with a (sync) zarr array does work. I also tried indexing a zarr array with the # try indexing with async zarr array
> result = await async_zarr.oindex.getitem(z2._async_array)
tests/test_indexing.py:2061:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/zarr/core/indexing.py:974: in getitem
return await self.array.get_orthogonal_selection(
src/zarr/core/array.py:1440: in get_orthogonal_selection
indexer = OrthogonalIndexer(selection, self.shape, self.metadata.chunk_grid)
src/zarr/core/indexing.py:878: in __init__
dim_indexer = BoolArrayDimIndexer(dim_sel, dim_len, dim_chunk_len)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <[AttributeError("'BoolArrayDimIndexer' object has no attribute 'dim_sel'") raised in repr()] BoolArrayDimIndexer object at 0x107e8e990>
dim_sel = <AsyncArray memory://4427184448/z2 shape=(2,) dtype=bool>, dim_len = 2, dim_chunk_len = 1
def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: int) -> None:
# check number of dimensions
if not is_bool_array(dim_sel, 1):
raise IndexError("Boolean arrays in an orthogonal selection must be 1-dimensional only")
# check shape
if dim_sel.shape[0] != dim_len:
raise IndexError(
f"Boolean array has the wrong length for dimension; expected {dim_len}, got {dim_sel.shape[0]}"
)
# precompute number of selected items for each chunk
nchunks = ceildiv(dim_len, dim_chunk_len)
chunk_nitems = np.zeros(nchunks, dtype="i8")
for dim_chunk_ix in range(nchunks):
dim_offset = dim_chunk_ix * dim_chunk_len
chunk_nitems[dim_chunk_ix] = np.count_nonzero(
> dim_sel[dim_offset : dim_offset + dim_chunk_len]
)
E TypeError: 'AsyncArray' object is not subscriptable
src/zarr/core/indexing.py:613: TypeError If that is supposed to work I can raise an issue for it, but it doesn't seem to be the same |
The Claude diagnosis points to a sync |
Not sure what to do about codecov, except add more tests |
Thanks for the extra tests! |
…3311) Co-authored-by: Tom Nicholas <[email protected]>
* new blank whatsnew * test async load using special zarr LatencyStore * don't use dask * async all the way down * remove assert False * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add pytest-asyncio to CI envs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * assert results are identical * implement async load for dataarray and dataset * factor out common logic * consolidate tests via a parametrized fixture * async_load -> load_async * make BackendArray an ABC * explain how to add async support for any backend in the docs * add new methods to api docs * whatsnew * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ci/minimum_versions.py * fix formatting * concurrently load different variables in ds.load_async using asyncio.gather * test concurrent loading of multiple variables in one dataset * fix non-awaited load_async * rearrange test order * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add test for orthogonal indexing * explicitly forbid orthogonal indexing * support async orthogonal indexing via zarr-developers/zarr-python#3083 * add test for vectorized indexing (even if it doesn't work) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add test for basic indexing * correct test to actually use vectorized indexing * refactor to parametrize indexing test * implement async vectorized indexing * revert breaking change to BackendArray * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove indirection in _ensure_cached method * IndexingAdapters don't need async get * Add tests * Add decoding test * Add IndexingAdapter mixin * [cherry] Making decoding arrays lazy too * parametrized over isel and sel * mock zarr.AsyncArray.getitem in test * tidy up the mocking * ensure the correct zarr class's method is patched for each test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add degenerate test case of no indexing * factor out the Latency part of LatencyStore * use mocks in multiple objects test * use mocks in multiple variables test * trim latencystore down to just what's needed to dodge zarr-developers/zarr-python#3105 (comment) * parametrizing indexing test over xarray classes * ensure we actually test vectorized indexing for Variable * use create_test_data * add @pytest.mark.asyncio * remove outdated readonly_store * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable tests to run when recent version of zarr-python is not available * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * separate tests to only run on correct versions of zarr * clear error message if async oindexing not available * parametrize over zarr_format * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add pytest-asyncio to other test CI env * fix some mypy errors * use method directly when possible * remove repeated API docs from bad merge * fix bad merge in release note * fix other bad merge in whatsnew * remove prints Co-authored-by: Deepak Cherian <[email protected]> * remove last print statement * test async basic indexing raises informative error before zarr-python v3.0.0 * test correct error message is raised for each indexing case * ensure each test runs on the earliest version of xaarr it can * remove pointless repeated getitem * set N_LAZY_VARS correctly in test * remove unused import * rename flag to make it more clear its only for orthogonal and vectorized indexing * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove IndexingAdapter special case * type fixes * return a deepcopy * try again * one more * Try again * try fixing _in_memory error by not returning the adapter class * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove scope=module from fixture for robustness * modify test to be happy with either error message * use Variable instead of Dataset to avoid race condition of indexing between different variables * fix bad merge in API docs * add test to test_backends.py * fix bug found by new test, causing pandas indexes to be converted to numpy arrays * add test to test_variable.py for lazy async indexing * move async load tests from test_async.py to test_backends.py * parametrize all tests over zarr_format * remove test_async.py file entirely * lint * Stricter assertions Co-authored-by: Deepak Cherian <[email protected]> * Update doc/whats-new.rst Co-authored-by: Deepak Cherian <[email protected]> * add docstring for Variable.load_async * make all load-related docstrings consistent * note about users being responsible for limiting concurrency * remove parametrization over zarr_format * account for Dataset having multiple lazy vars * refactor test parametrization to use pytest.param(..., id=...) syntax * refactor TestBackendIndexing to combine sync and async checks in one function * move test_load_async onto test base class * should fix mypy error * add back in the parametrize_zarr_format to avoid trying to write v3 data using zarr-python v2 * parametrize test over async --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>
Array
has.oindex
and.vindex
methods, butAsyncArray
has no equivalent. This PR adds them. It only adds the get methods, not the set methods, which I thought could be deferred to a follow-up PR.I want it for pydata/xarray#10327 (comment)
TODO:
docs/user-guide/*.rst
changes/