Skip to content

Add support for fancy indexing on get/setitem #725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Oct 19, 2021

Conversation

jni
Copy link
Contributor

@jni jni commented Apr 28, 2021

Addresses #657

This matches NumPy behaviour in that basic, boolean, and vectorized integer (fancy) indexing are all accessible from __{get,set}item__. Users still have access to all the indexing methods if they want to be sure to use only basic indexing (integer + slices).

I'm not 100% sure about the approach, but it seemed much easier to use a try/except than to try to detect all the cases when fancy indexing should be used. Happy to hear some guidance about how best to arrange that.

I still need to update docstrings + docs, will do that now — thanks for the checklist below. 😂

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@pep8speaks
Copy link

pep8speaks commented Apr 28, 2021

Hello @jni! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-19 19:51:41 UTC

@jni
Copy link
Contributor Author

jni commented Apr 28, 2021

Questions for maintainers:

  • how to update release.rst
  • do I need to update any of the notebooks? git grep "unsupported selection item for basic indexing" has some hits in there.
  • have I missed anything?

Thanks! 😊 Here's a picture of me painting into a zarr array with napari. 😃

Screen Shot 2021-04-28 at 3 40 07 pm

@joshmoore
Copy link
Member

  • how to update release.rst

Add a commit like 03dce69 to this PR?

@jni
Copy link
Contributor Author

jni commented Apr 28, 2021

But is the next version 2.8.2 or 2.9? =) Or 3.0! Surely this momentous change merits such an upgrade. 😂

@joshmoore
Copy link
Member

🙄 (:smile:)

I'd think 2.8.2 but could get behind 2.9 as well.

@codecov
Copy link

codecov bot commented Apr 28, 2021

Codecov Report

Merging #725 (cf9051c) into master (4f8cb35) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head cf9051c differs from pull request most recent head b9c8f59. Consider uploading reports for the commit b9c8f59 to get more accurate results

@@           Coverage Diff           @@
##           master     #725   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          31       31           
  Lines       10936    10986   +50     
=======================================
+ Hits        10930    10980   +50     
  Misses          6        6           
Impacted Files Coverage Δ
zarr/core.py 100.00% <100.00%> (ø)
zarr/indexing.py 100.00% <100.00%> (ø)
zarr/tests/test_indexing.py 100.00% <100.00%> (ø)

@jakirkham
Copy link
Member

Yeah 2.9 makes sense. It's a new feature to some extent

Am hoping that 3.0 will also be when we are using the v3 spec 😉

@jni
Copy link
Contributor Author

jni commented Apr 29, 2021

Ok, all done. I used 2.9.0 based on @jakirkham's logic that enhancements belong in minor version bumps, not patch. This should be ready for a full review now. 🙏

@jni jni force-pushed the fancy-indexing branch from ccb4b05 to 582f430 Compare May 4, 2021 00:32
@jni
Copy link
Contributor Author

jni commented May 4, 2021

kind ping! 🙏 Just rebased on latest master.

@joshmoore
Copy link
Member

Looking at #657, @shoyer, any thoughts?

@joshmoore
Copy link
Member

import numpy as np
import pytest as pt
import zarr as zr


@pt.fixture
def na():
    return np.arange(60).reshape((3, 4, 5))


@pt.fixture
def za(na):
    return zr.array(data=na)


def test_base(na, za):
    np.equal(na, za)


def test_base(na, za):
    np.equal(na[0], za[0])
    np.equal(na[1], za[1])
    np.equal(na[2, 3], za[2, 3])


@pt.mark.skip
def test_integer_array(na, za):
    print(na[[0, 0]], za[[0, 0]])


def test_int_arr_arr(na, za):
    x = [0, 0, 0]
    y = [1, 1, 1]
    z = [2, 2, 2]
    np.equal(na[x, y, z], za[x, y, z])


def test_simple(na, za):
    np.equal(na[0, 1, [2, 3]], za[0, 1, [2, 3]])
    np.equal(na[[0, 1], 1, [2, 3]], za[[0, 1], 1, [2, 3]])


@pt.mark.skip
def test_slicing(na, za):
    np.equal(na[1:, 2:, [3, 4]], za[1:, 2:, [3, 4]])

Assuming there are no objections by tomorrow, I'm inclined to move forward with this.

@shoyer
Copy link
Contributor

shoyer commented May 4, 2021

I don't think Zarr should fall back to using vindex. There are some cases where the dimension order from vindex differs from NumPy, so this could perpetuate exactly the sort of confusion we were trying to solve with vindex :).

We could definitely leverage vindex for normal array indexing, but we should be sure reorder dimensions in a way that is consistent with NumPy.

@joshmoore
Copy link
Member

There are some cases where the dimension order from vindex differs from NumPy,...

Thanks @shoyer. Do you have an example handy? And does this count as a reversal of #657 (comment) ?

@shoyer
Copy link
Contributor

shoyer commented May 4, 2021

There are some cases where the dimension order from vindex differs from NumPy,...

Thanks @shoyer. Do you have an example handy? And does this count as a reversal of #657 (comment) ?

This is probably the canonical edge case:
https://numpy.org/neps/nep-0021-advanced-indexing.html#mixed-indexing

I think I'm being consistent with my previous suggestion to "copy NumPy's behavior for fancy indexing" here :). It's just important to recognize that vindex is not consistent with NumPy's behavior for fancy indexing, in the case where slices and arrays are mixed.

So as long as that edge case is avoided, we could support "array only" fancy indexing. (This would probably be a good place to start.)

Or we could support mixed slice/array indexing, but only if we're careful to re-order array dimensions the same way that NumPy does.

@joshmoore
Copy link
Member

joshmoore commented May 5, 2021

That matches:

@pt.mark.skip
def test_slicing(na, za):
    np.equal(na[1:, 2:, [3, 4]], za[1:, 2:, [3, 4]])

right? which if I remove the skip fails with

IndexError: unsupported selection item for basic indexing; expected integer or slice, got <class 'list'>

full run
(base) /tmp $pytest -svk slicing t.py
========================================================= test session starts =========================================================
platform darwin -- Python 3.7.6, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 -- /usr/local/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /private/tmp
plugins: testinfra-3.4.0, napari-plugin-engine-0.1.7
collected 5 items / 4 deselected / 1 selected

t.py::test_slicing FAILED

============================================================== FAILURES ===============================================================
____________________________________________________________ test_slicing _____________________________________________________________

na = array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]...     [[40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59]]])
za = <zarr.core.Array (3, 4, 5) int64>

    def test_slicing(na, za):
>       np.equal(na[1:, 2:, [3, 4]], za[1:, 2:, [3, 4]])

t.py:44:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/anaconda3/lib/python3.7/site-packages/zarr/core.py:571: in __getitem__
    return self.get_basic_selection(selection, fields=fields)
/usr/local/anaconda3/lib/python3.7/site-packages/zarr/core.py:697: in get_basic_selection
    fields=fields)
/usr/local/anaconda3/lib/python3.7/site-packages/zarr/core.py:737: in _get_basic_selection_nd
    indexer = BasicIndexer(selection, self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <zarr.indexing.BasicIndexer object at 0x7fc3c8217950>, selection = (slice(1, None, None), slice(2, None, None), [3, 4])
array = <zarr.core.Array (3, 4, 5) int64>

    def __init__(self, selection, array):

        # handle ellipsis
        selection = replace_ellipsis(selection, array._shape)

        # setup per-dimension indexers
        dim_indexers = []
        for dim_sel, dim_len, dim_chunk_len in \
                zip(selection, array._shape, array._chunks):

            if is_integer(dim_sel):
                dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_chunk_len)

            elif is_slice(dim_sel):
                dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len)

            else:
                raise IndexError('unsupported selection item for basic indexing; '
                                 'expected integer or slice, got {!r}'
>                                .format(type(dim_sel)))
E               IndexError: unsupported selection item for basic indexing; expected integer or slice, got <class 'list'>

/usr/local/anaconda3/lib/python3.7/site-packages/zarr/indexing.py:285: IndexError
======================================================= short test summary info =======================================================
FAILED t.py::test_slicing - IndexError: unsupported selection item for basic indexing; expected integer or slice, got <class 'list'>
=================================================== 1 failed, 4 deselected in 4.60s ===================================================
(base) /tmp $

@jni
Copy link
Contributor Author

jni commented May 6, 2021

So as long as that edge case is avoided, we could support "array only" fancy indexing. (This would probably be a good place to start.)

@shoyer I didn't really know the best way to "check" that we were in "array only" fancy indexing mode, since arrays and scalars can be broadcast together. Also, are lists supported?

@jni
Copy link
Contributor Author

jni commented May 6, 2021

or should I just check that there are no instances of slice in the indexing tuple?

@shoyer
Copy link
Contributor

shoyer commented May 6, 2021

or should I just check that there are no instances of slice in the indexing tuple?

NumPy says "When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes)"
https://numpy.org/doc/stable/reference/arrays.indexing.html#combining-advanced-and-basic-indexing

So this is the case to avoid.

@shoyer
Copy link
Contributor

shoyer commented May 6, 2021

@shoyer I didn't really know the best way to "check" that we were in "array only" fancy indexing mode, since arrays and scalars can be broadcast together. Also, are lists supported?

From NumPy's perspective:

  • lists are arrays
  • scalars are arrays if doing "mixed" or "advanced indexing" (if there is at least one other array)

@jni
Copy link
Contributor Author

jni commented May 11, 2021

Ok I think I've caught that case and raised an appropriate error then. Let me know if that meets requirements now! 🤞

Comment on lines +310 to +320
def test_fancy_indexing_doesnt_mix_with_slicing():
z = zarr.zeros((20, 20))
with pytest.raises(IndexError):
z[[1, 2, 3], :] = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another case would worth checking would be something like:

    z = zarr.zeros((20, 20, 20))
    with pytest.raises(IndexError):
        z[[1, 2, 3], 0] = 2

This doesn't look like mixed indexing but it actually is because of the implicit slice at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point! 😬 Will work on this.

zarr/core.py Outdated
Comment on lines 671 to 675
try:
result = self.get_basic_selection(pure_selection, fields=fields)
except IndexError:
result = self.vindex[selection]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably have the same check you put in __setitem__?

zarr/core.py Outdated
Comment on lines 1237 to 1239
if (isinstance(pure_selection, tuple)
and any(isinstance(elem, slice) for elem in pure_selection)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably have to "expand" the indexer to a tuple with length equal to the number of array dimensions (i.e., by replacing Ellipsis and padding by slice(None)) in order to determine if vindex will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer, very good point. Is there a function to do this already in zarr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, how about just checking that the length of the tuple matches the dimension of the array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a helper I wrote for Xarray:
https://github.com/pydata/xarray/blob/234b40a37e484a795e6b12916315c80d70570b27/xarray/core/indexing.py#L31

Conceivably you could copy it into Zarr as long as you are compliant with the Apache 2.0 license.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may well be something like this in Zarr already, I'm not too familiar with the Zarr codebase here.

@jni
Copy link
Contributor Author

jni commented May 11, 2021

Ok, take 3. 😂 This time I went with an explicit check for fancy indexing rather than a fallback, now that I understand the exact requirements better.

For the curious, the if-statement adds 8µs to the indexing operation, which normally takes about 150µs at least (that's when everything is in-memory and we are grabbing a single value), for a ~5% worst-case slowdown. I hope that's acceptable.

In [6]: arr = zarr.zeros((5, 5))

In [7]: %timeit arr[0, 1] = 4
157 µs ± 5.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: idx.is_pure_fancy_indexing((0, 1), 2)
Out[8]: False

In [9]: %timeit idx.is_pure_fancy_indexing((0, 1), 2)
8.42 µs ± 430 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Two other questions:

  1. It turns out that zarr.indexing.is_integer(x) returns True for bools. Is this desired/expected? Or should I add a check to exclude them explicitly?
  2. For lists, I have only added a check that the first element is an int, since I don't think we want a check to take O(n) time. Alternatives welcome.

@jni
Copy link
Contributor Author

jni commented May 12, 2021

not sure why the CI isn't running, too... 🤷

@jni
Copy link
Contributor Author

jni commented May 12, 2021

nvmd just read the notice on top of the checks. 🤦

@jni jni force-pushed the fancy-indexing branch from eb39593 to 1aa575d Compare May 18, 2021 10:20
@jni jni force-pushed the fancy-indexing branch from 1aa575d to ce04dd3 Compare May 31, 2021 04:49
@jni
Copy link
Contributor Author

jni commented May 31, 2021

Ok I've rebased on master. The failure in the previous CI run was:

assert a[42] == z[np.uint64(42)] in TestArrayWithFSStorePartialRead.test_array_1d

________________ TestArrayWithFSStorePartialRead.test_array_1d _________________

self = <zarr.tests.test_core.TestArrayWithFSStorePartialRead testMethod=test_array_1d>

    def test_array_1d(self):
        a = np.arange(1050)
        z = self.create_array(shape=a.shape, chunks=100, dtype=a.dtype)
    
        # check properties
        assert len(a) == len(z)
        assert a.ndim == z.ndim
        assert a.shape == z.shape
        assert a.dtype == z.dtype
        assert (100,) == z.chunks
        assert a.nbytes == z.nbytes
        assert 11 == z.nchunks
        assert 0 == z.nchunks_initialized
        assert (11,) == z.cdata_shape
    
        # check empty
        b = z[:]
        assert isinstance(b, np.ndarray)
        assert a.shape == b.shape
        assert a.dtype == b.dtype
    
        # check attributes
        z.attrs['foo'] = 'bar'
        assert 'bar' == z.attrs['foo']
    
        # set data
        z[:] = a
    
        # check properties
        assert a.nbytes == z.nbytes
        assert 11 == z.nchunks
        assert 11 == z.nchunks_initialized
    
        # check slicing
        assert_array_equal(a, np.array(z))
        assert_array_equal(a, z[:])
        assert_array_equal(a, z[...])
        # noinspection PyTypeChecker
        assert_array_equal(a, z[slice(None)])
        assert_array_equal(a[:10], z[:10])
        assert_array_equal(a[10:20], z[10:20])
        assert_array_equal(a[-10:], z[-10:])
        assert_array_equal(a[:10, ...], z[:10, ...])
        assert_array_equal(a[10:20, ...], z[10:20, ...])
        assert_array_equal(a[-10:, ...], z[-10:, ...])
        assert_array_equal(a[..., :10], z[..., :10])
        assert_array_equal(a[..., 10:20], z[..., 10:20])
        assert_array_equal(a[..., -10:], z[..., -10:])
        # ...across chunk boundaries...
        assert_array_equal(a[:110], z[:110])
        assert_array_equal(a[190:310], z[190:310])
        assert_array_equal(a[-110:], z[-110:])
        # single item
        assert a[0] == z[0]
        assert a[-1] == z[-1]
        # unusual integer items
        assert a[42] == z[np.int64(42)]
        assert a[42] == z[np.int32(42)]
>       assert a[42] == z[np.uint64(42)]

zarr/tests/test_core.py:212: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
zarr/core.py:675: in __getitem__
    result = self.vindex[selection]
zarr/indexing.py:818: in __getitem__
    return self.array.get_coordinate_selection(selection, fields=fields)
zarr/core.py:1032: in get_coordinate_selection
    out = self._get_selection(indexer=indexer, out=out, fields=fields)
zarr/core.py:1141: in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
zarr/core.py:1868: in _chunk_getitems
    self._process_chunk(
zarr/core.py:1752: in _process_chunk
    index_selection = PartialChunkIterator(chunk_selection, self.chunks)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <zarr.indexing.PartialChunkIterator object at 0x7f1ef83bec70>
selection = [slice(42, 43.0, 1)], arr_shape = (100,)

    def __init__(self, selection, arr_shape):
        selection = make_slice_selection(selection)
        self.arr_shape = arr_shape
    
        # number of selection dimensions can't be greater than the number of chunk dimensions
        if len(selection) > len(self.arr_shape):
            raise ValueError(
                "Selection has more dimensions then the array:\n"
                f"selection dimensions = {len(selection)}\n"
                f"array dimensions = {len(self.arr_shape)}"
            )
    
        # any selection can not be out of the range of the chunk
>       selection_shape = np.empty(self.arr_shape)[tuple(selection)].shape
E       TypeError: slice indices must be integers or None or have an __index__ method

zarr/indexing.py:958: TypeError

which I can't reproduce locally:

 $ pytest zarr/tests/test_core.py::TestArrayWithFSStorePartialRead
============================= test session starts ==============================
platform darwin -- Python 3.8.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
PySide2 5.14.2.1 -- Qt runtime 5.14.2 -- Qt compiled 5.14.2
rootdir: /Users/jni/projects/zarr-python, configfile: pytest.ini
plugins: timeout-1.3.4, localserver-0.5.0, napari-plugin-engine-0.1.9, hypothesis-6.0.0, mypy-plugins-1.6.1, ordering-0.6, order-0.11.0, qt-3.3.0, cov-2.8.1, napari-0.4.8.dev133+g46f295b5
collected 48 items

zarr/tests/test_core.py ................................................ [100%]

I hope this run passes but if not, any suggestions about what is going on are most welcome!

@jni
Copy link
Contributor Author

jni commented May 31, 2021

... Nope. 😭 Anyone got any ideas?

@jni
Copy link
Contributor Author

jni commented Sep 29, 2021

@joshmoore 🎉 🎉 🎉 Probably should have done this a while back. 😂 I'll revert the pytest commits and then hopefully 🤞 this can go in?

@jni
Copy link
Contributor Author

jni commented Sep 29, 2021

... Or maybe we want to keep the timeouts? They're kinda handy.

@jni jni requested review from joshmoore and shoyer October 2, 2021 02:23
Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes & test status are looking good. Also bubbling up #725 (comment) in case there were any opinions:

For the curious, the if-statement adds 8µs to the indexing operation, which normally takes about 150µs at least (that's when everything is in-memory and we are grabbing a single value), for a ~5% worst-case slowdown. I hope that's acceptable.

@@ -340,7 +341,7 @@ def attrs(self):
@property
def ndim(self):
"""Number of dimensions."""
return len(self.shape)
return len(self._shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from zulip: this fixed the weird timeout issue, though we still don't know why the deadlock was platform specific.

@joshmoore
Copy link
Member

Merging now that 2.10.2 is released (fixing #840). I'll look into getting a 2.11rc1 released soon.

@joshmoore joshmoore merged commit 7c31f04 into zarr-developers:master Oct 19, 2021
@jni jni deleted the fancy-indexing branch October 20, 2021 00:35
@jni
Copy link
Contributor Author

jni commented Oct 20, 2021

🎉:rocket:!!!!

@jni jni mentioned this pull request Oct 20, 2021
@jakirkham jakirkham mentioned this pull request Dec 1, 2021
enthusiastdev121 added a commit to enthusiastdev121/zarr-python that referenced this pull request Aug 19, 2024
* Fall back on .vindex when basic indexing fails

Addresses #657

This matches NumPy behaviour in that basic, boolean, and vectorized
integer (fancy) indexing are all accessible from `__{get,set}item__`.
Users still have access to all the indexing methods if they want to be
sure to use only basic indexing (integer + slices).

* Fix basic selection test now with no IndexError

* Fix basic_selection_2d test with no vindex error

* Add specific test for fancy indexing fallback

* Update get/setitem docstrings

* Update tutorial.rst

* PEP8 fix

* Rename test array to z as in other tests

* Add release note

* Avoid mixing slicing and array indexing in setitem

* Actually test for fancy index rather than try/except

* Add check for 1D fancy index (no tuple)

* Add tests for implicit fancy indexing, and getitem

* Add expected blank line

* Add strict test for make_slice_selection

* Ensure make_slice_selection returns valid NumPy slices

* Make pytest verbose to see what is failing in windows

* Add 5 min per-test timeout

* Use private self._shape when determining ndim

self.shape is a property that hides a lot of computation, and, more
importantly, it can be waiting for an update and so .ndim *cannot* be
accessed during a reshape/append. See:

zarr-developers/zarr-python#725 (comment)

This should prevent that behavior.

Co-authored-by: Josh Moore <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants