Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ Indexing
- Bug in indexing non-scalar value from ``Series`` having non-unique ``Index`` will return value flattened (:issue:`17610`)
- Bug in :func:`DatetimeIndex.insert` where inserting ``NaT`` into a timezone-aware index incorrectly raised (:issue:`16357`)
- Bug in ``__setitem__`` when indexing a :class:`DataFrame` with a 2-d boolean ndarray (:issue:`18582`)

- Bug in :func:`MultiIndex.__contains__` where non-tuple keys would return ``True`` even if they had been dropped (:issue:`19027`)

I/O
^^^
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2122,6 +2122,10 @@ def _maybe_to_slice(loc):

if not isinstance(key, tuple):
loc = self._get_level_indexer(key, level=0)
# _get_level_indexer returns an empty slice if the key has
# been dropped from the MultiIndex
if isinstance(loc, slice) and loc.start == loc.stop:
raise KeyError(key)
return _maybe_to_slice(loc)

keylen = len(key)
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -628,7 +628,11 @@ def _convert_level_number(level_num, columns):
levsize = len(level_labels)
drop_cols = []
for key in unique_groups:
loc = this.columns.get_loc(key)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what test exercises this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_stack_order_with_unsorted_levels(self):

def test_stack_mixed_level(self):

def test_stack_partial_multiIndex(self):

These were the 3. And it looks like Travis doesn't have permalinks to specific lines like GitHub?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just took a look at the test that caused most of the builds to fail:

del df['A']
assert len(df.columns) == 2
# A still in the levels, BUT get a KeyError if trying
# to delete
assert ('A', ) not in df.columns
with pytest.raises(KeyError):
del df[('A',)]
# xref: https://github.com/pandas-dev/pandas/issues/2770
# the 'A' is STILL in the columns!
assert 'A' in df.columns

and we just changed this so I'll go ahead and negate the assert.

loc = this.columns.get_loc(key)
except KeyError:
drop_cols.append(key)
continue

# can make more efficient?
# we almost always return a slice
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/indexing/test_multiindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -705,6 +705,24 @@ def test_multiindex_symmetric_difference(self):
result = idx ^ idx2
assert result.names == [None, None]

def test_multiindex_contains_dropped(self):
# GH 19027
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment

idx = MultiIndex.from_product([[1, 2], [3, 4]])
assert 2 in idx
idx = idx.drop(2)

# drop implementation keeps 2 in the levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before the comment

assert 2 in idx.levels[0]
# but it should no longer be in the index itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test with a non-integer MI as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment made me realize that I had focused onto integers specifically when this should apply to all hashable keys (sorry). But, that would mean tuples would then be treated differently from how they currently are, e.g.:

In [88]: idx = pd.MultiIndex.from_product([[(1, 2), (2, 3)], [(3, 4), (4, 5)]])

In [89]: ((1, 2), (3, 4)) in idx
Out[89]: True

In [90]: (1, 2) in idx
Out[90]: False

In [91]: ((1, 2),) in idx
Out[91]: True

So, should it assume that a nested tuple like in In [89] should be treated like tuples are currently treated but a non-nested tuple like in In [90] should be treated like strings and ints?

Copy link
Contributor Author

@databasedav databasedav Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the same vein, idx.drop((1, 2)) raises a KeyError and might warrant a separate issue since one would expect similar behavior to MultiIndex.from_product([[1, 2], [3, 4]]).drop(2), right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment made me realize that I had focused onto integers specifically when this should apply to all hashable keys (sorry).

I think the idea should just be: any key k which is not already a tuple must become one ((k,)). The result will be, I think, consistent with current behavior (Out[89]:, Out[90]:, Out[91]:), and you don't even need to check if key can be hashed - if it can't, it will just fail later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the same vein, idx.drop((1, 2)) raises a KeyError and might warrant a separate issue

Sure, in principle we could try with ((1,2),) if (1,2) is not found. However, I personally don't think it's a good idea, as it introduces ambiguity (we should be clear to users that a tuple in a MultiIndex is first and foremost a key spanning across levels) and increases complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!
Doing it by simply turning non-tuples into tuples like so:

if not isinstance(key, tuple):
    try:
        return self.get_loc((key,))
    except (LookupError, TypeError):
        raise KeyError(key)

produces a performance warning for slightly larger multi-indeces:

In [31]: idx = pd.MultiIndex.from_product([[1, 2], [3, 4]])

In [32]: 2 in idx
Out[32]: True

In [33]: idx = pd.MultiIndex.from_product([[2, 1, 2], [3, 4]])

In [34]: 2 in idx
/home/avi/anaconda3/envs/pandas-dev/lib/python3.6/site-packages/IPython/terminal/ipapp.py:356: PerformanceWarning: indexing past lexsort depth may impact performance.
  self.shell.mainloop()
Out[34]: True

Or we could use the fact that a key that has been dropped from a multi-index will return an empty slice when fed into the index's _get_level_indexer like so:

if not isinstance(key, tuple):
    loc = self._get_level_indexer(key, level=0)
    if isinstance(loc, slice) and loc.start == loc.stop:
        raise KeyError(key)
    return _maybe_to_slice(loc)

which does not produce the performance warnings and is ~3 times faster and simply adds to what it is right now:

if not isinstance(key, tuple):
loc = self._get_level_indexer(key, level=0)
return _maybe_to_slice(loc)

but is potentially less clear (and more hacky?). Which one should I use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using the faster one and added a comment to explain what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense

assert 2 not in idx

# also applies to strings
idx = MultiIndex.from_product([['a', 'b'], ['c', 'd']])
assert 'a' in idx
idx = idx.drop('a')
assert 'a' in idx.levels[0]
assert 'a' not in idx


class TestMultiIndexSlicers(object):

Expand Down