Skip to content

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented May 23, 2025

Array has .oindex and .vindex methods, but AsyncArray has no equivalent. This PR adds them. It only adds the get methods, not the set methods, which I thought could be deferred to a follow-up PR.

I want it for pydata/xarray#10327 (comment)

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label May 23, 2025
TomNicholas added a commit to TomNicholas/xarray that referenced this pull request May 23, 2025
@TomNicholas TomNicholas changed the title Add async oindex method to AsyncArray Add async oindex and vindex methods to AsyncArray May 29, 2025
Copy link

codecov bot commented Jul 22, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.73%. Comparing base (a0c56fb) to head (6fbb6b1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/indexing.py 64.28% 10 Missing ⚠️
src/zarr/core/array.py 70.83% 7 Missing ⚠️
src/zarr/core/common.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3083      +/-   ##
==========================================
+ Coverage   60.68%   60.73%   +0.04%     
==========================================
  Files          78       78              
  Lines        9356     9407      +51     
==========================================
+ Hits         5678     5713      +35     
- Misses       3678     3694      +16     
Files with missing lines Coverage Δ
src/zarr/core/chunk_grids.py 60.37% <ø> (+0.56%) ⬆️
src/zarr/core/common.py 49.16% <60.00%> (+0.47%) ⬆️
src/zarr/core/array.py 69.66% <70.83%> (+0.03%) ⬆️
src/zarr/core/indexing.py 68.10% <64.28%> (-0.10%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TomNicholas
Copy link
Member Author

TomNicholas commented Jul 22, 2025

@dcherian suggested making the sync oindex and vindex getitem methods call the new async versions.

EDIT: I think this is already the case?

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jul 23, 2025
Comment on lines +1611 to +1612
@property
def oindex(self) -> AsyncOIndex[T_ArrayMetadata]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this API to try to follow this pattern:

  • Array.__getitem__ (exists)
  • Array.oindex.__getitem__ (exists)
  • Array.vindex.__getitem__ (exists)
  • AsyncArray.getitem (exists)
  • AsyncArray.oindex.getitem (new)
  • AsyncArray.vindex.getitem (new)

because python doesn't let you make an async version of the __getitem__ magic method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the sync versions to use sync(self._async_array.oindex.getitem) instead of going to _get_selection directly? That will get us some test coverage

Copy link
Member Author

@TomNicholas TomNicholas Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds very sensible, but when I try to refactor to do that I find that it causes a problem where sync() gets called from within an async coroutine. I wouldn't have thought this should happen, but here's claude's explanation of why it does:

> Explain how exactly sync is being called from within an async coroutineLooking at the error and the code, here's exactly how sync() is being called from within an async coroutine:

  The Call Chain

  1. Test calls zarr array indexing: za[zix] in test_indexing_with_zarr_array
  2. Array.getitem (src/zarr/core/array.py:2566):
  return self.vindex[cast("CoordinateSelection | MaskSelection", selection)]
  3. VIndex.getitem (src/zarr/core/indexing.py:1248):
  return sync(self.array._async_array.vindex.getitem(selection))
  4. sync() tries to run the coroutine but detects it's already in a running loop

  The Problem

  The issue is in the __array__ method call chain:

  1. VIndex.__getitem__ calls sync() with self.array._async_array.vindex.getitem(selection)
  2. This eventually leads to AsyncArray._get_selection() being called
  3. But somewhere in the process, Array.__array__() gets called (src/zarr/core/array.py:2413)
  4. __array__() calls self[...] which goes back to Array.__getitem__
  5. This creates a nested call where sync() is called while already inside an async context

  The Root Cause

  The function _zarr_array_to_int_or_bool_array() at line 85 calls np.asarray(arr), which triggers the __array__ protocol on the zarr array. This causes:

  def _zarr_array_to_int_or_bool_array(arr: Array) -> npt.NDArray[np.intp] | npt.NDArray[np.bool_]:
      if arr.dtype.kind in ("i", "b"):
          return np.asarray(arr)  # <-- This calls arr.__array__()

  When np.asarray() is called on a zarr Array, it calls Array.__array__(), which calls self[...], which eventually calls sync() again - but we're already
  inside a sync() call from the VIndex, creating the nested async context error.

  The original code before the changes avoided this by handling the zarr array conversion within the sync methods directly, rather than delegating to async
  methods that would create this nested sync situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for indexing with a Zarr array, we should convert to numpy array before the sync call

@TomNicholas TomNicholas requested a review from d-v-b July 28, 2025 13:31
@dcherian
Copy link
Contributor

Shall we also have the sync array getitme methods use these async methods?
Example:

return sync(
self._async_array._get_selection(
BasicIndexer(selection, self.shape, self.metadata.chunk_grid),
out=out,
fields=fields,
prototype=prototype,
)
)

@TomNicholas
Copy link
Member Author

Yea I would like to, but don't fully understand how to get that to work. So I thought I could leave that for a follow-up.

#3083 (comment)

@dcherian
Copy link
Contributor

dcherian commented Jul 29, 2025

OK but presumably that errors means async indexing with Zarr arrays also doesn't work (https://github.com/zarr-developers/zarr-python/pull/3083/files#r2231114456). Can you open an issue to track please?

@TomNicholas
Copy link
Member Author

indexing with Zarr arrays

I didn't even know it was possible to index a zarr array with another zarr array!

presumably that errors means async indexing with Zarr arrays also doesn't work

Actually I just added a test that seems to show that indexing a zarr array with a (sync) zarr array does work. I also tried indexing a zarr array with the AsyncArray (same test but using ._async_array) which raises with

        # try indexing with async zarr array
>       result = await async_zarr.oindex.getitem(z2._async_array)

tests/test_indexing.py:2061: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/zarr/core/indexing.py:974: in getitem
    return await self.array.get_orthogonal_selection(
src/zarr/core/array.py:1440: in get_orthogonal_selection
    indexer = OrthogonalIndexer(selection, self.shape, self.metadata.chunk_grid)
src/zarr/core/indexing.py:878: in __init__
    dim_indexer = BoolArrayDimIndexer(dim_sel, dim_len, dim_chunk_len)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <[AttributeError("'BoolArrayDimIndexer' object has no attribute 'dim_sel'") raised in repr()] BoolArrayDimIndexer object at 0x107e8e990>
dim_sel = <AsyncArray memory://4427184448/z2 shape=(2,) dtype=bool>, dim_len = 2, dim_chunk_len = 1

    def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: int) -> None:
        # check number of dimensions
        if not is_bool_array(dim_sel, 1):
            raise IndexError("Boolean arrays in an orthogonal selection must be 1-dimensional only")
    
        # check shape
        if dim_sel.shape[0] != dim_len:
            raise IndexError(
                f"Boolean array has the wrong length for dimension; expected {dim_len}, got {dim_sel.shape[0]}"
            )
    
        # precompute number of selected items for each chunk
        nchunks = ceildiv(dim_len, dim_chunk_len)
        chunk_nitems = np.zeros(nchunks, dtype="i8")
        for dim_chunk_ix in range(nchunks):
            dim_offset = dim_chunk_ix * dim_chunk_len
            chunk_nitems[dim_chunk_ix] = np.count_nonzero(
>               dim_sel[dim_offset : dim_offset + dim_chunk_len]
            )
E           TypeError: 'AsyncArray' object is not subscriptable

src/zarr/core/indexing.py:613: TypeError

If that is supposed to work I can raise an issue for it, but it doesn't seem to be the same sync problem that we were discussing before.

@dcherian
Copy link
Contributor

The Claude diagnosis points to a sync np.array being called in _zarr_array_to_ints_or_bool. (#3083 (comment)

@dcherian
Copy link
Contributor

Not sure what to do about codecov, except add more tests

@dstansby dstansby added this to the 3.1.2 milestone Jul 30, 2025
@dcherian
Copy link
Contributor

Thanks for the extra tests!

@dcherian dcherian merged commit 108ec58 into zarr-developers:main Jul 30, 2025
31 checks passed
@TomNicholas TomNicholas deleted the async_oindex branch July 30, 2025 16:56
meeseeksmachine pushed a commit to meeseeksmachine/zarr-python that referenced this pull request Jul 30, 2025
dstansby pushed a commit that referenced this pull request Jul 31, 2025
TomNicholas added a commit to pydata/xarray that referenced this pull request Aug 14, 2025
* new blank whatsnew

* test async load using special zarr LatencyStore

* don't use dask

* async all the way down

* remove assert False

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add pytest-asyncio to CI envs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* assert results are identical

* implement async load for dataarray and dataset

* factor out common logic

* consolidate tests via a parametrized fixture

* async_load -> load_async

* make BackendArray an ABC

* explain how to add async support for any backend in the docs

* add new methods to api docs

* whatsnew

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix ci/minimum_versions.py

* fix formatting

* concurrently load different variables in ds.load_async using asyncio.gather

* test concurrent loading of multiple variables in one dataset

* fix non-awaited load_async

* rearrange test order

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test for orthogonal indexing

* explicitly forbid orthogonal indexing

* support async orthogonal indexing via zarr-developers/zarr-python#3083

* add test for vectorized indexing (even if it doesn't work)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test for basic indexing

* correct test to actually use vectorized indexing

* refactor to parametrize indexing test

* implement async vectorized indexing

* revert breaking change to BackendArray

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove indirection in _ensure_cached method

* IndexingAdapters don't need async get

* Add tests

* Add decoding test

* Add IndexingAdapter mixin

* [cherry] Making decoding arrays lazy too

* parametrized over isel and sel

* mock zarr.AsyncArray.getitem in test

* tidy up the mocking

* ensure the correct zarr class's method is patched for each test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add degenerate test case of no indexing

* factor out the Latency part of LatencyStore

* use mocks in multiple objects test

* use mocks in multiple variables test

* trim latencystore down to just what's needed to dodge zarr-developers/zarr-python#3105 (comment)

* parametrizing indexing test over xarray classes

* ensure we actually test vectorized indexing for Variable

* use create_test_data

* add @pytest.mark.asyncio

* remove outdated readonly_store

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable tests to run when recent version of zarr-python is not available

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* separate tests to only run on correct versions of zarr

* clear error message if async oindexing not available

* parametrize over zarr_format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add pytest-asyncio to other test CI env

* fix some mypy errors

* use method directly when possible

* remove repeated API docs from bad merge

* fix bad merge in release note

* fix other bad merge in whatsnew

* remove prints

Co-authored-by: Deepak Cherian <[email protected]>

* remove last print statement

* test async basic indexing raises informative error before zarr-python v3.0.0

* test correct error message is raised for each indexing case

* ensure each test runs on the earliest version of xaarr it can

* remove pointless repeated getitem

* set N_LAZY_VARS correctly in test

* remove unused import

* rename flag to make it more clear its only for orthogonal and vectorized indexing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove IndexingAdapter special case

* type fixes

* return a deepcopy

* try again

* one more

* Try again

* try fixing _in_memory error by not returning the adapter class

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove scope=module from fixture for robustness

* modify test to be happy with either error message

* use Variable instead of Dataset to avoid race condition of indexing between different variables

* fix bad merge in API docs

* add test to test_backends.py

* fix bug found by new test, causing pandas indexes to be converted to numpy arrays

* add test to test_variable.py for lazy async indexing

* move async load tests from test_async.py to test_backends.py

* parametrize all tests over zarr_format

* remove test_async.py file entirely

* lint

* Stricter assertions

Co-authored-by: Deepak Cherian <[email protected]>

* Update doc/whats-new.rst

Co-authored-by: Deepak Cherian <[email protected]>

* add docstring for Variable.load_async

* make all load-related docstrings consistent

* note about users being responsible for limiting concurrency

* remove parametrization over zarr_format

* account for Dataset having multiple lazy vars

* refactor test parametrization to use pytest.param(..., id=...) syntax

* refactor TestBackendIndexing to combine sync and async checks in one function

* move test_load_async onto test base class

* should fix mypy error

* add back in the parametrize_zarr_format to avoid trying to write v3 data using zarr-python v2

* parametrize test over async

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants