-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Performance: .unique / .nunique of categorical series slow on large data set #12278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls pd.show_versions() |
Sorry @jreback more details have been added above |
so,
so if you want |
I suppose it is possible to add an option (to all |
I didn't realise that, I thought they were equivalent. Thanks |
It would be good to clarify this in the docs I think:
|
ok, let's repurpose this then. @joshlk want to do a PR? |
unique
and nunique
of categorical series slow on large data set
Can I get this one? |
sure |
Thanks for adding the clarity to the docs |
I just came across this, and my example also showed 2-3 orders of magnitude performance increase, which:
I'd take a somewhat different approach so that no API changes are required, and the win is 'free' in the 'standard' use cases (i.e. something like Note:
|
@joshlk such a 'flag' would need to check both that all categories are used, that there are no missing values, that it is sorted (so also quite costly to compute), and would be have to be invalidated in many operations (eg that do sorting, selecting data, etc). So I don't think something like this would be worth it (certainly from a code maintenance perspective). |
I was thinking that the only way to change categories was with a few function calls, and hence it could be easy with e.g. a decorator - but you're right, subsetting data with a select would also cause problems. If there are too many more possibilities, then it wouldn't be worth it.
I don't see why you need to check. E.g. if categories are created from "astype('category')" then they're all used. Any action which could change the categories used would flip the flag, so you wouldn't use the 'quick' method anyway.
Do you mean
Are the results of
If it's going to be too painful, then no, it's probably not worth it. Then again, if it's not too painful, it offers some pretty worthwhile performance gains in a large proportion of uses. |
I've noticed that
Series.unique
andSeries.nunique
when used with a categorical series can be slow on large dataset. Presumably its not utilising the shortcuts:Heres an example in iPhython:
Its significantly slower indicating its not using the above shortcut.
pd.show_versions()
The text was updated successfully, but these errors were encountered: