Performance: .unique / .nunique of categorical series slow on large data set #12278

joshlk · 2016-02-10T10:36:56Z

I've noticed that Series.unique and Series.nunique when used with a categorical series can be slow on large dataset. Presumably its not utilising the shortcuts:

Series.unique = Series.cat.categories
Series.nunique = len(Series.cat.categories)

Heres an example in iPhython:

s = pd.Series(np.random.choice(['a','b','c'], 100000000)).astype('category')
%time s.nunique()
896 ms
%time len(s.cat.categories)
55.1 µs

Its significantly slower indicating its not using the above shortcut.

pd.show_versions()

pandas: 0.17.1
Cython: 0.23.4
numpy: 1.10.1
IPython: 4.0.1

The text was updated successfully, but these errors were encountered:

jreback · 2016-02-10T12:12:49Z

pls pd.show_versions()
and an example

joshlk · 2016-02-10T15:21:24Z

Sorry @jreback more details have been added above

jreback · 2016-02-10T15:29:17Z

so, .unique on a categorical has a couple of guarantees, namely that it is in the order of appearance, and it only includes values that are actually present, e.g.

In [3]: s = Series(list('babc')).astype('category',categories=list('abcd'))

In [4]: s.unique()
Out[4]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [6]: s.cat.categories
Out[6]: Index([u'a', u'b', u'c', u'd'], dtype='object')

so if you want .categories, then just use that.

jreback · 2016-02-10T15:30:46Z

I suppose it is possible to add an option (to all .unique), not just this one, to do some sort of quick uniqueness like this, but not sure of the utility beyond categoricals.

joshlk · 2016-02-10T15:31:48Z

I didn't realise that, I thought they were equivalent. Thanks

jorisvandenbossche · 2016-02-10T16:00:07Z

It would be good to clarify this in the docs I think:

add a note to the docstring of categories that this includes all categories and not only those present in the data
add the above example (Performance: .unique / .nunique of categorical series slow on large data set #12278 (comment)) to the tutorial docs categorical.rst

jreback · 2016-02-10T16:02:13Z

ok, let's repurpose this then. @joshlk want to do a PR?

Dorozhko-Anton · 2016-02-22T05:03:34Z

Can I get this one?

jreback · 2016-02-22T05:06:27Z

sure

joshlk · 2016-02-23T15:34:58Z

Thanks for adding the clarity to the docs

kodonnell · 2017-11-30T19:08:22Z

I just came across this, and my example also showed 2-3 orders of magnitude performance increase, which:

is pretty sweet for the user, and
probably could be extended beyond unique (e.g. in grouping etc.)

I suppose it is possible to add an option (to all .unique), not just this one, to do some sort of quick uniqueness like this, but not sure of the utility beyond categoricals.

I'd take a somewhat different approach so that no API changes are required, and the win is 'free' in the 'standard' use cases (i.e. something like df['col1'] = df.col1.astype('category'), i.e. without manually specifying unused categories). That is, my naive gut feeling is that it shouldn't be too hard to add a flag to a categorical column for whether or not it's amenable to quick uniqing. Then all that's required is changing unique to detect 'if column is categorical and is amenable to quick uniquing, then use the quick method'. No change that I can see for the user (in API or the result) except much better performance. Thoughts @jreback ?

Note:

would need to check for np.nan - could use .cat.codes for that as opposed to pd.isnull(...).any() or similar.
(reiterating) the amenable to quicking uniquing flag may be useful in other functions for performance gains e.g. grouping.

jorisvandenbossche · 2017-11-30T21:06:46Z

@joshlk such a 'flag' would need to check both that all categories are used, that there are no missing values, that it is sorted (so also quite costly to compute), and would be have to be invalidated in many operations (eg that do sorting, selecting data, etc). So I don't think something like this would be worth it (certainly from a code maintenance perspective).

kodonnell · 2017-12-01T00:34:55Z

I was thinking that the only way to change categories was with a few function calls, and hence it could be easy with e.g. a decorator - but you're right, subsetting data with a select would also cause problems. If there are too many more possibilities, then it wouldn't be worth it.

check ... that all categories are used

I don't see why you need to check. E.g. if categories are created from "astype('category')" then they're all used. Any action which could change the categories used would flip the flag, so you wouldn't use the 'quick' method anyway.

check ... that there are no missing values

Do you mean np.nan? As above, isn't that an easy fix?

check ... that it is sorted

Are the results of unique required to be sorted? If so, couldn't you sort them afterward (i.e. quite quickly)?

So I don't think something like this would be worth it (certainly from a code maintenance perspective).

If it's going to be too painful, then no, it's probably not worth it. Then again, if it's not too painful, it offers some pretty worthwhile performance gains in a large proportion of uses.

jreback closed this as completed Feb 10, 2016

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Feb 10, 2016

jreback reopened this Feb 10, 2016

jreback added Docs Difficulty Novice labels Feb 10, 2016

jreback added this to the 0.18.0 milestone Feb 10, 2016

jreback changed the title ~~Performance: unique and nunique of categorical series slow on large data set~~ Performance: .unique / .nunique of categorical series slow on large data set Feb 10, 2016

jreback modified the milestones: 0.18.1, 0.18.0 Feb 12, 2016

Dorozhko-Anton mentioned this issue Feb 22, 2016

Notes about unique vs cat.categories #12278 #12421

Closed

4 tasks

jreback modified the milestones: 0.18.0, 0.18.1 Feb 22, 2016

jreback closed this as completed in 6b544de Feb 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance: .unique / .nunique of categorical series slow on large data set #12278

Performance: .unique / .nunique of categorical series slow on large data set #12278

joshlk commented Feb 10, 2016

jreback commented Feb 10, 2016

Uh oh!

joshlk commented Feb 10, 2016

Uh oh!

jreback commented Feb 10, 2016

Uh oh!

jreback commented Feb 10, 2016

Uh oh!

joshlk commented Feb 10, 2016

Uh oh!

jorisvandenbossche commented Feb 10, 2016

Uh oh!

jreback commented Feb 10, 2016

Uh oh!

Dorozhko-Anton commented Feb 22, 2016

Uh oh!

jreback commented Feb 22, 2016

Uh oh!

joshlk commented Feb 23, 2016

Uh oh!

kodonnell commented Nov 30, 2017

Uh oh!

jorisvandenbossche commented Nov 30, 2017

Uh oh!

kodonnell commented Dec 1, 2017

Uh oh!

Uh oh!

Performance: .unique / .nunique of categorical series slow on large data set #12278

Performance: .unique / .nunique of categorical series slow on large data set #12278

Comments

joshlk commented Feb 10, 2016

jreback commented Feb 10, 2016

Uh oh!

joshlk commented Feb 10, 2016

Uh oh!

jreback commented Feb 10, 2016

Uh oh!

jreback commented Feb 10, 2016

Uh oh!

joshlk commented Feb 10, 2016

Uh oh!

jorisvandenbossche commented Feb 10, 2016

Uh oh!

jreback commented Feb 10, 2016

Uh oh!

Dorozhko-Anton commented Feb 22, 2016

Uh oh!

jreback commented Feb 22, 2016

Uh oh!

joshlk commented Feb 23, 2016

Uh oh!

kodonnell commented Nov 30, 2017

Uh oh!

jorisvandenbossche commented Nov 30, 2017

Uh oh!

kodonnell commented Dec 1, 2017

Uh oh!