Skip to content

Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 23, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -711,7 +711,7 @@ Reshaping
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
- Bug in :meth:`DataFrame.drop_duplicates` for empty ``DataFrame`` which incorrectly raises an error (:issue:`20516`)

Build Changes
^^^^^^^^^^^^^
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4335,6 +4335,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False):
-------
deduplicated : DataFrame
"""
if self.empty:
return self.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, can you try to replace the last line return self[-duplicated] by return self.iloc[-duplicated], for the reasons I mentioned earlier?

Copy link
Contributor Author

@HyunTruth HyunTruth Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then wouldn't index be lost in the case of DataFrame(index=['A', 'B', 'C'])? As it will be self.iloc[[]]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use self.iloc[~duplicated], as you said, we will be able to maintain the columns if we have 4 columns and 0 rows. However, if we have 0 columns and 4 rows, we have the same problem - the outcome is self.iloc[[]], which means that the rows won't be selected, for the same reason with using self[[]] for columns, thus ending up with 0 columns and 0 rows, once again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it'd make sense for a dataframe with rows but no columns to drop the empty rows as duplicate.

But as @jreback if happy with this implementation, just leave it like this. I didn't see his comment earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a test, and here's the result:

import pandas as pd

a = pd.DataFrame(index=['A', 'B'])
b = a.iloc[[]]
a.shape # (2, 0)
b.shape # (0, 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me. If we consider that empty rows are equal among them, then pd.DataFrame(index=['A', 'B']).duplicated() would return False, True, and .iloc[-duplicated] would return the first row.

But as I said, I'm happy to keep the original DataFrame as is for this case, as @jreback is happy with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if you see the empty rows as equals, then it does make sense. I haven't thought of it that way. Thanks.


inplace = validate_bool_kwarg(inplace, 'inplace')
duplicated = self.duplicated(subset, keep=keep)

Expand Down Expand Up @@ -4369,6 +4372,9 @@ def duplicated(self, subset=None, keep='first'):
from pandas.core.sorting import get_group_index
from pandas._libs.hashtable import duplicated_int64, _SIZE_HINT_LIMIT

if self.empty:
return Series()

def f(vals):
labels, shape = algorithms.factorize(
vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/frame/test_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,23 @@ def test_drop_duplicates_tuple():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize('df', [
DataFrame(),
DataFrame(columns=[]),
DataFrame(columns=['A', 'B', 'C']),
DataFrame(index=[]),
DataFrame(index=['A', 'B', 'C'])
])
def test_drop_duplicates_empty(df):
# GH 20516
result = df.drop_duplicates()
tm.assert_frame_equal(result, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case with inplace=True please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


result = df.copy()
result.drop_duplicates(inplace=True)
tm.assert_frame_equal(result, df)


def test_drop_duplicates_NA():
# none
df = DataFrame({'A': [None, None, 'foo', 'bar',
Expand Down