-
-
Notifications
You must be signed in to change notification settings - Fork 329
Add support for fancy indexing on get/setitem #725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a commit like 03dce69 to this PR? |
But is the next version 2.8.2 or 2.9? =) Or 3.0! Surely this momentous change merits such an upgrade. 😂 |
🙄 (:smile:) I'd think 2.8.2 but could get behind 2.9 as well. |
Codecov Report
@@ Coverage Diff @@
## master #725 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 31 31
Lines 10936 10986 +50
=======================================
+ Hits 10930 10980 +50
Misses 6 6
|
Yeah 2.9 makes sense. It's a new feature to some extent Am hoping that 3.0 will also be when we are using the v3 spec 😉 |
Ok, all done. I used 2.9.0 based on @jakirkham's logic that enhancements belong in minor version bumps, not patch. This should be ready for a full review now. 🙏 |
kind ping! 🙏 Just rebased on latest master. |
Assuming there are no objections by tomorrow, I'm inclined to move forward with this. |
I don't think Zarr should fall back to using We could definitely leverage |
Thanks @shoyer. Do you have an example handy? And does this count as a reversal of #657 (comment) ? |
This is probably the canonical edge case: I think I'm being consistent with my previous suggestion to "copy NumPy's behavior for fancy indexing" here :). It's just important to recognize that So as long as that edge case is avoided, we could support "array only" fancy indexing. (This would probably be a good place to start.) Or we could support mixed slice/array indexing, but only if we're careful to re-order array dimensions the same way that NumPy does. |
That matches:
right? which if I remove the
full run
|
@shoyer I didn't really know the best way to "check" that we were in "array only" fancy indexing mode, since arrays and scalars can be broadcast together. Also, are lists supported? |
or should I just check that there are no instances of |
NumPy says "When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes)" So this is the case to avoid. |
From NumPy's perspective:
|
Ok I think I've caught that case and raised an appropriate error then. Let me know if that meets requirements now! 🤞 |
def test_fancy_indexing_doesnt_mix_with_slicing(): | ||
z = zarr.zeros((20, 20)) | ||
with pytest.raises(IndexError): | ||
z[[1, 2, 3], :] = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another case would worth checking would be something like:
z = zarr.zeros((20, 20, 20))
with pytest.raises(IndexError):
z[[1, 2, 3], 0] = 2
This doesn't look like mixed indexing but it actually is because of the implicit slice at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point! 😬 Will work on this.
zarr/core.py
Outdated
try: | ||
result = self.get_basic_selection(pure_selection, fields=fields) | ||
except IndexError: | ||
result = self.vindex[selection] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably have the same check you put in __setitem__
?
zarr/core.py
Outdated
if (isinstance(pure_selection, tuple) | ||
and any(isinstance(elem, slice) for elem in pure_selection) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you probably have to "expand" the indexer to a tuple with length equal to the number of array dimensions (i.e., by replacing Ellipsis
and padding by slice(None)
) in order to determine if vindex will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shoyer, very good point. Is there a function to do this already in zarr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, how about just checking that the length of the tuple matches the dimension of the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a helper I wrote for Xarray:
https://github.com/pydata/xarray/blob/234b40a37e484a795e6b12916315c80d70570b27/xarray/core/indexing.py#L31
Conceivably you could copy it into Zarr as long as you are compliant with the Apache 2.0 license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may well be something like this in Zarr already, I'm not too familiar with the Zarr codebase here.
Ok, take 3. 😂 This time I went with an explicit check for fancy indexing rather than a fallback, now that I understand the exact requirements better. For the curious, the if-statement adds 8µs to the indexing operation, which normally takes about 150µs at least (that's when everything is in-memory and we are grabbing a single value), for a ~5% worst-case slowdown. I hope that's acceptable. In [6]: arr = zarr.zeros((5, 5))
In [7]: %timeit arr[0, 1] = 4
157 µs ± 5.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [8]: idx.is_pure_fancy_indexing((0, 1), 2)
Out[8]: False
In [9]: %timeit idx.is_pure_fancy_indexing((0, 1), 2)
8.42 µs ± 430 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Two other questions:
|
not sure why the CI isn't running, too... 🤷 |
nvmd just read the notice on top of the checks. 🤦 |
Ok I've rebased on master. The failure in the previous CI run was:
assert a[42] == z[np.uint64(42)] in TestArrayWithFSStorePartialRead.test_array_1d________________ TestArrayWithFSStorePartialRead.test_array_1d _________________
self = <zarr.tests.test_core.TestArrayWithFSStorePartialRead testMethod=test_array_1d>
def test_array_1d(self):
a = np.arange(1050)
z = self.create_array(shape=a.shape, chunks=100, dtype=a.dtype)
# check properties
assert len(a) == len(z)
assert a.ndim == z.ndim
assert a.shape == z.shape
assert a.dtype == z.dtype
assert (100,) == z.chunks
assert a.nbytes == z.nbytes
assert 11 == z.nchunks
assert 0 == z.nchunks_initialized
assert (11,) == z.cdata_shape
# check empty
b = z[:]
assert isinstance(b, np.ndarray)
assert a.shape == b.shape
assert a.dtype == b.dtype
# check attributes
z.attrs['foo'] = 'bar'
assert 'bar' == z.attrs['foo']
# set data
z[:] = a
# check properties
assert a.nbytes == z.nbytes
assert 11 == z.nchunks
assert 11 == z.nchunks_initialized
# check slicing
assert_array_equal(a, np.array(z))
assert_array_equal(a, z[:])
assert_array_equal(a, z[...])
# noinspection PyTypeChecker
assert_array_equal(a, z[slice(None)])
assert_array_equal(a[:10], z[:10])
assert_array_equal(a[10:20], z[10:20])
assert_array_equal(a[-10:], z[-10:])
assert_array_equal(a[:10, ...], z[:10, ...])
assert_array_equal(a[10:20, ...], z[10:20, ...])
assert_array_equal(a[-10:, ...], z[-10:, ...])
assert_array_equal(a[..., :10], z[..., :10])
assert_array_equal(a[..., 10:20], z[..., 10:20])
assert_array_equal(a[..., -10:], z[..., -10:])
# ...across chunk boundaries...
assert_array_equal(a[:110], z[:110])
assert_array_equal(a[190:310], z[190:310])
assert_array_equal(a[-110:], z[-110:])
# single item
assert a[0] == z[0]
assert a[-1] == z[-1]
# unusual integer items
assert a[42] == z[np.int64(42)]
assert a[42] == z[np.int32(42)]
> assert a[42] == z[np.uint64(42)]
zarr/tests/test_core.py:212:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
zarr/core.py:675: in __getitem__
result = self.vindex[selection]
zarr/indexing.py:818: in __getitem__
return self.array.get_coordinate_selection(selection, fields=fields)
zarr/core.py:1032: in get_coordinate_selection
out = self._get_selection(indexer=indexer, out=out, fields=fields)
zarr/core.py:1141: in _get_selection
self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
zarr/core.py:1868: in _chunk_getitems
self._process_chunk(
zarr/core.py:1752: in _process_chunk
index_selection = PartialChunkIterator(chunk_selection, self.chunks)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <zarr.indexing.PartialChunkIterator object at 0x7f1ef83bec70>
selection = [slice(42, 43.0, 1)], arr_shape = (100,)
def __init__(self, selection, arr_shape):
selection = make_slice_selection(selection)
self.arr_shape = arr_shape
# number of selection dimensions can't be greater than the number of chunk dimensions
if len(selection) > len(self.arr_shape):
raise ValueError(
"Selection has more dimensions then the array:\n"
f"selection dimensions = {len(selection)}\n"
f"array dimensions = {len(self.arr_shape)}"
)
# any selection can not be out of the range of the chunk
> selection_shape = np.empty(self.arr_shape)[tuple(selection)].shape
E TypeError: slice indices must be integers or None or have an __index__ method
zarr/indexing.py:958: TypeError which I can't reproduce locally:
I hope this run passes but if not, any suggestions about what is going on are most welcome! |
... Nope. 😭 Anyone got any ideas? |
self.shape is a property that hides a lot of computation, and, more importantly, it can be waiting for an update and so .ndim *cannot* be accessed during a reshape/append. See: zarr-developers#725 (comment) This should prevent that behavior.
@joshmoore 🎉 🎉 🎉 Probably should have done this a while back. 😂 I'll revert the pytest commits and then hopefully 🤞 this can go in? |
... Or maybe we want to keep the timeouts? They're kinda handy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes & test status are looking good. Also bubbling up #725 (comment) in case there were any opinions:
For the curious, the if-statement adds 8µs to the indexing operation, which normally takes about 150µs at least (that's when everything is in-memory and we are grabbing a single value), for a ~5% worst-case slowdown. I hope that's acceptable.
@@ -340,7 +341,7 @@ def attrs(self): | |||
@property | |||
def ndim(self): | |||
"""Number of dimensions.""" | |||
return len(self.shape) | |||
return len(self._shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from zulip: this fixed the weird timeout issue, though we still don't know why the deadlock was platform specific.
Merging now that 2.10.2 is released (fixing #840). I'll look into getting a 2.11rc1 released soon. |
🎉:rocket:!!!! |
* Fall back on .vindex when basic indexing fails Addresses #657 This matches NumPy behaviour in that basic, boolean, and vectorized integer (fancy) indexing are all accessible from `__{get,set}item__`. Users still have access to all the indexing methods if they want to be sure to use only basic indexing (integer + slices). * Fix basic selection test now with no IndexError * Fix basic_selection_2d test with no vindex error * Add specific test for fancy indexing fallback * Update get/setitem docstrings * Update tutorial.rst * PEP8 fix * Rename test array to z as in other tests * Add release note * Avoid mixing slicing and array indexing in setitem * Actually test for fancy index rather than try/except * Add check for 1D fancy index (no tuple) * Add tests for implicit fancy indexing, and getitem * Add expected blank line * Add strict test for make_slice_selection * Ensure make_slice_selection returns valid NumPy slices * Make pytest verbose to see what is failing in windows * Add 5 min per-test timeout * Use private self._shape when determining ndim self.shape is a property that hides a lot of computation, and, more importantly, it can be waiting for an update and so .ndim *cannot* be accessed during a reshape/append. See: zarr-developers/zarr-python#725 (comment) This should prevent that behavior. Co-authored-by: Josh Moore <[email protected]>
Addresses #657
This matches NumPy behaviour in that basic, boolean, and vectorized integer (fancy) indexing are all accessible from
__{get,set}item__
. Users still have access to all the indexing methods if they want to be sure to use only basic indexing (integer + slices).I'm not 100% sure about the approach, but it seemed much easier to use a try/except than to try to detect all the cases when fancy indexing should be used. Happy to hear some guidance about how best to arrange that.
I still need to update docstrings + docs, will do that now — thanks for the checklist below. 😂
TODO: