Skip to content

BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
2aca389
Revert earlier change and use to_numpy
samuelsinayoko Feb 2, 2019
2d12d2f
Remove warning by using to_numpy
samuelsinayoko Feb 2, 2019
f357101
Make warning go away
samuelsinayoko Feb 2, 2019
7cc4c37
revert initial bugfix
samuelsinayoko Feb 2, 2019
8ec653a
Add test for contains in interval index categorical
samuelsinayoko Feb 2, 2019
878802e
Check get_loc on interval index raises KeyError
samuelsinayoko Feb 2, 2019
4dabe0e
Add test for get_indexer
samuelsinayoko Feb 2, 2019
564d88d
Add a test for get_indexer with different type
samuelsinayoko Feb 2, 2019
f4c43e3
Make first two tests pass
samuelsinayoko Feb 2, 2019
a09a07e
Make third test pass
samuelsinayoko Feb 2, 2019
6c887e6
remove commented out code
samuelsinayoko Feb 3, 2019
0a143f2
Improve error message
samuelsinayoko Feb 3, 2019
0730cd6
Rename, move and parametrize indexer test
samuelsinayoko Feb 3, 2019
246eb57
Use numpy_array_equal in indexer test
samuelsinayoko Feb 3, 2019
93f75ea
Refactor interval index get_loc test
samuelsinayoko Feb 4, 2019
268db81
Fix bug introduced in earlier commit
samuelsinayoko Feb 4, 2019
a5aa1e8
Add reminder comment to use raise from for python 3
samuelsinayoko Feb 6, 2019
d480872
Include key in error message.
samuelsinayoko Feb 6, 2019
6ed1080
Add larger interval range to test
samuelsinayoko Feb 6, 2019
120e2bc
get_loc should raise KeyError if the supplied key has the wrong type
samuelsinayoko Feb 6, 2019
2c48272
Only return -1 in get_indexer for incorrect values
samuelsinayoko Feb 10, 2019
02127ff
Better tests for get_indexer_errors
samuelsinayoko Feb 10, 2019
ad13d9e
Fix broken test in test_interval
samuelsinayoko Feb 11, 2019
0ff356c
Fix broken tests in test_concat
samuelsinayoko Feb 16, 2019
9f6b5c0
Merge remote-tracking branch 'upstream/master' into 23264
samuelsinayoko Apr 10, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -766,8 +766,11 @@ def get_loc(self, key, method=None):
key = Interval(left, right, key.closed)
else:
key = self._maybe_cast_slice_bound(key, 'left', None)

start, stop = self._find_non_overlapping_monotonic_bounds(key)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to do something similar in the else branch to cover the overlapping/non-monotonic case, e.g. I think something like pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)]).get_loc('foo') will still fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look into @jorisvandenbossche's comment on raising the error in the engine itself (especially if that's the behaviour for int64), but I think I've addressed everything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : There is code in place within the engine that raises a KeyError, but strings queries fail before it gets there since the engine is expecting a scalar_t type (fused type consisting of numeric types) for key:

def get_loc(self, scalar_t key):

I'm not super well versed in Cython. Is there a graceful way to force this to raise a KeyError within the Cython code? Removing the scalar_t type gets a step further but still raises a TypeError as the code expects things to be comparable (probably some perf implications to removing it too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the other engines have the key as object typed, and then afterwards do a check of that.
But for me fine as well to leave that for now, and do the check here in the level above that. But on the long term would still be good to make the behaviour consistent throughout the different engines.

start, stop = self._find_non_overlapping_monotonic_bounds(key)
except TypeError:
# get_loc should raise KeyError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a comment as Tom proposed (TODO(py3): use raise from.) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can now use raise from (as we are PY3 only)

raise KeyError('key is hashable but of incorrect type')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for key errors we typically just pass the key itself as message, so might be good to be consistent with that. Or at least I would include the key in the message, something like: "Key {0} is of the incorrect type".format(key)


if start is None or stop is None:
return slice(start, stop)
Expand All @@ -785,7 +788,10 @@ def get_loc(self, key, method=None):
left, right = _get_interval_closed_bounds(key)
return self._engine.get_loc_interval(left, right)
else:
return self._engine.get_loc(key)
try:
return self._engine.get_loc(key)
except TypeError:
raise KeyError('No engine for key {!r}'.format(key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention "engine" here (that is something internal to pandas, while this error message will be visible for users)


def get_value(self, series, key):
if com.is_bool_indexer(key):
Expand Down Expand Up @@ -819,7 +825,12 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
return np.arange(len(self), dtype='intp')

if self.is_non_overlapping_monotonic:
start, stop = self._find_non_overlapping_monotonic_bounds(target)
try:
start, stop = (
self._find_non_overlapping_monotonic_bounds(target)
)
except TypeError:
return np.repeat(np.int(-1), len(target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel I am not fully sure this will cover all cases (but what was there before also not).

target can in principle a mixture of valid and invalid keys. So maybe the easiest would be to fall back to the else branch that iterates through the elements separately in case this raises a TypeError.
That could be done by putting a pass here, and putting the return value of three lines below in the else part of a try/except/else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry getting a bit confused here. You're saying that we should pass on line 833 here and left the code run to the else branch starting on line 851 (non IntervalIndex) where it loops over each element and appends -1 to the list if a KeyError is raised (I would probably add TypeError too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is basically what I meant I think.

So that if you do idx.get_indexer(['a', 1]) (where 1 is a valid key), you get [-0, 1] as result instead of [-1, -1] (assuming that idx.get_loc(1) would return 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've just pushed a commit implementing this. I had to make a few tweaks to the "non IntervalIndex" else branch (starting line 849 in core/indexes/interval.py) to make my two new tests pass, but hopefully haven't broken anything. See 2c48272


start_plus_one = start + 1
if not ((start_plus_one < stop).any()):
Expand All @@ -834,7 +845,13 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):

# non IntervalIndex
else:
indexer = np.concatenate([self.get_loc(i) for i in target])
vals = []
for i in target:
try:
vals.append(self.get_loc(i))
except KeyError:
vals.append(np.array([-1]))
indexer = np.concatenate(vals)

return ensure_platform_int(indexer)

Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/indexes/interval/test_interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -435,6 +435,14 @@ def test_get_loc_value(self):
idx = IntervalIndex.from_arrays([0, 2], [1, 3])
pytest.raises(KeyError, idx.get_loc, 1.5)

# GH25087, test get_loc returns key error for interval indexes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a new test? (you can leave it in this place, but just put a def test_get_loc_invalid_key(self) above this line)
Reason is that the other test is commented to be replaced, but this new test we want to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have got a few broken tests to fix. Hoping to make the build green in the coming days.

msg = 'key is hashable but of incorrect type'
with pytest.raises(KeyError, match=msg):
idx.get_loc('a')
idx = pd.interval_range(0, 1.0)
with pytest.raises(KeyError, match=msg):
idx.get_loc('a')

# To be removed, replaced by test_interval_new.py (see #16316, #16386)
def slice_locs_cases(self, breaks):
# TODO: same tests for more index types
Expand Down Expand Up @@ -616,6 +624,14 @@ def test_get_indexer_length_one(self, item, closed):
expected = np.array([0] * len(item), dtype='intp')
tm.assert_numpy_array_equal(result, expected)

@pytest.mark.parametrize('index', [
pd.interval_range(0, 1),
pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)])
])
def test_get_indexer_errors(self, index):
expected = np.array([-1], dtype='intp')
assert tm.assert_numpy_array_equal(index.get_indexer(['gg']), expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you here also test multiple values and a mixture of values? Like get_indexer(['a', 'b'] and get_indexer([1, 'a'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will look into this over the weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 02127ff


# Make consistent with test_interval_new.py (see #16316, #16386)
@pytest.mark.parametrize('arrays', [
(date_range('20180101', periods=4), date_range('20180103', periods=4)),
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/indexing/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,12 @@ def test_getitem_scalar(self):
result = s[cats[0]]
assert result == expected

def test_contains_interval_range(self):
"""Check we can use contains """
intervals = pd.interval_range(0.0, 1.0)
cats = pd.Categorical(intervals)
assert 'gg' not in cats

def test_slicing_directly(self):
cat = Categorical(["a", "b", "c", "d", "a", "b", "c"])
sliced = cat[3]
Expand Down