Skip to content

BUG: groupby with dropna=False drops nulls from categorical groupers #49652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Dec 2, 2022

Conversation

rhshadrach
Copy link
Member

The function nunique here is faster than using pd.unique, but only works on integers:

arr = np.random.randint(0, 300, size)
%timeit len(unique(arr))
%timeit nunique_ints(arr)

size: 10
8.94 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.94 µs ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

size: 100
9.82 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.01 µs ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

size: 1000
11 µs ± 76.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.74 µs ± 9.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

size: 10000
27.9 µs ± 66.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
10.4 µs ± 9.43 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

size: 100000
187 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
78.8 µs ± 408 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type labels Nov 11, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge conflict otherwise LGTM

@mroeschke mroeschke added this to the 2.0 milestone Dec 2, 2022
@mroeschke mroeschke merged commit 60ed993 into pandas-dev:main Dec 2, 2022
@mroeschke
Copy link
Member

Thanks @rhshadrach

@rhshadrach rhshadrach deleted the groupby_dropna_filtering branch December 2, 2022 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Get wrong result when groupby category column with dropna=False
2 participants