CLN: get_flattened_iterator #35515

mroeschke · 2020-08-02T07:07:45Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Refactor to just a single function

…_iterator

jbrockmendel · 2020-08-02T20:56:10Z

pandas/core/sorting.py

-    # provide "flattened" iterator for multi-group setting
-    mapper = _KeyMapper(comp_ids, ngroups, levels, labels)
-    return [mapper.get_key(i) for i in range(ngroups)]
+def get_flattened_iterator(


good cleanup here, but how about make this get_flattened_list and not need to cast in the calling function?

…_iterator

jbrockmendel · 2020-08-04T18:59:36Z

LGTM

WillAyd · 2020-08-04T23:04:27Z

pandas/core/sorting.py

+        table.map(comp_ids, labs.astype(np.int64, copy=False))
+        tables.append(table)
+    return [
+        tuple(level[table.get_item(i)] for table, level in zip(tables, levels))


Is there any performance difference in creating an intermediary list to store the result of zip(tables, levels) rather than doing it only the fly in each iteration here?

Good point. Eliminates a loop iteration and storing these table objects

jreback · 2020-08-05T02:09:58Z

pandas/core/sorting.py

+    tables = []
+    for labs, level in zip(labels, levels):
+        table = hashtable.Int64HashTable(ngroups)
+        table.map(comp_ids, labs.astype(np.int64, copy=False))


you could make this a list-comprehension, maybe it would be slightly less readable though

…_iterator

WillAyd · 2020-08-05T15:14:28Z

pandas/core/sorting.py

+) -> List[Tuple]:
+    """Map compressed group id -> key tuple."""
+    comp_ids = comp_ids.astype(np.int64, copy=False)
+    arrays: Dict[int, List] = defaultdict(list)


Suggested change

arrays: Dict[int, List] = defaultdict(list)

arrays: DefaultDict[int, List] = defaultdict(list)

This exists in the typing module. Can the List be List[int]?

Can the List be List[int]?

#30539 (comment)

simonjayhawkins · 2020-08-06T12:54:33Z

pandas/core/sorting.py

-    mapper = _KeyMapper(comp_ids, ngroups, levels, labels)
-    return [mapper.get_key(i) for i in range(ngroups)]
+def get_flattened_list(
+    comp_ids: np.ndarray, ngroups: int, levels: List["Index"], labels: List[np.ndarray]


from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md

avoid invariant collection types (List, Dict) in argument positions, in favor of covariant types like Mapping or Sequence;

in this case both levels and labels are used in zip, so Iterable is more permissive.

jreback · 2020-08-06T15:47:29Z

can you also run the groupby asvs (in particular the ones that exercise the full int64 cardinaility) to assure no perf diffs.

…_iterator

mroeschke · 2020-08-06T17:10:38Z

This small benchmark shows no performance degradation:

In [1]: df = pd.DataFrame({'a': list(range(1000)) * 2, 'b': list(range(1000)) * 2})

In [4]: %timeit df.groupby('a').sum()
1.06 ms ± 74.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <--PR


In [4]: %timeit df.groupby('a').sum()
1.1 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- master

I'm getting this error when trying to run the asvs locally

STDERR -------->
   /Users/matthewroeschke/pandas-mroeschke/asv_bench/env/a36d556a9d1611aba4092dc48036d261/lib/python3.6/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
     "Distutils was imported before Setuptools. This usage is discouraged "
   warning: pandas/_libs/groupby.pyx:1130:26: Unreachable code
   pandas/_libs/parsers.c:49607:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/parsers.c:49607:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/ops_dispatch.c:3877:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/ops_dispatch.c:3877:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/offsets.c:89935:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/offsets.c:89935:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/parsing.c:39517:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/parsing.c:39517:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/period.c:44813:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/period.c:44813:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/timedeltas.c:43722:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/timedeltas.c:43722:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/timestamps.c:41544:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/timestamps.c:41544:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   error: command 'gcc' failed with exit status 1

·· Failed to build the project and import the benchmark suite.

WillAyd · 2020-08-06T17:13:04Z

If on a Mac I think you need to change the Python version in the ASV conf for it to compile. I think there’s an open issue to update that

…

Sent from my iPhone

On Aug 6, 2020, at 10:10 AM, Matthew Roeschke ***@***.***> wrote: This small benchmark shows no performance degradation: In [1]: df = pd.DataFrame({'a': list(range(1000)) * 2, 'b': list(range(1000)) * 2}) In [4]: %timeit df.groupby('a').sum() 1.06 ms ± 74.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <--PR In [4]: %timeit df.groupby('a').sum() 1.1 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- master I'm getting this error when trying to run the asvs locally STDERR --------> /Users/matthewroeschke/pandas-mroeschke/asv_bench/env/a36d556a9d1611aba4092dc48036d261/lib/python3.6/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first. "Distutils was imported before Setuptools. This usage is discouraged " warning: pandas/_libs/groupby.pyx:1130:26: Unreachable code pandas/_libs/parsers.c:49607:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/parsers.c:49607:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. pandas/_libs/ops_dispatch.c:3877:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/ops_dispatch.c:3877:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. pandas/_libs/tslibs/offsets.c:89935:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/tslibs/offsets.c:89935:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. pandas/_libs/tslibs/parsing.c:39517:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/tslibs/parsing.c:39517:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. pandas/_libs/tslibs/period.c:44813:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/tslibs/period.c:44813:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. pandas/_libs/tslibs/timedeltas.c:43722:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/tslibs/timedeltas.c:43722:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. pandas/_libs/tslibs/timestamps.c:41544:52: error: code will never be executed [-Werror,-Wunreachable-code] } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^~~~ pandas/_libs/tslibs/timestamps.c:41544:38: note: silence by adding parentheses to mark code as explicitly dead } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) { ^ /* DISABLES CODE */ ( ) 1 error generated. error: command 'gcc' failed with exit status 1 ·· Failed to build the project and import the benchmark suite. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

…_iterator

jreback · 2020-08-06T21:51:27Z

thanks @mroeschke

Matt Roeschke added 2 commits August 1, 2020 23:27

CLN: Simplify get_flattened_iterator

2f5c450

Yield correct value

071378d

mroeschke added the Clean label Aug 2, 2020

mroeschke added this to the 1.2 milestone Aug 2, 2020

Matt Roeschke added 2 commits August 2, 2020 00:36

isort

b3af159

Merge remote-tracking branch 'upstream/master' into cln/get_flattened…

1c581e6

…_iterator

jbrockmendel reviewed Aug 2, 2020

View reviewed changes

Matt Roeschke added 3 commits August 2, 2020 14:34

Rename to get_flattened_list

1cb2fed

Merge remote-tracking branch 'upstream/master' into cln/get_flattened…

dd8263c

…_iterator

Merge remote-tracking branch 'upstream/master' into cln/get_flattened…

a09af1b

…_iterator

WillAyd reviewed Aug 4, 2020

View reviewed changes

jreback reviewed Aug 5, 2020

View reviewed changes

Matt Roeschke added 3 commits August 4, 2020 21:01

store intermediate arrays for performance

52b938c

Merge remote-tracking branch 'upstream/master' into cln/get_flattened…

775ea23

…_iterator

typing

f889efd

WillAyd reviewed Aug 5, 2020

View reviewed changes

Matt Roeschke added 4 commits August 5, 2020 09:45

Add better typing

8e83f5a

type levels

0ce1136

Change noqa code

4584c6d

Add another noqa

68273ab

simonjayhawkins reviewed Aug 6, 2020

View reviewed changes

Matt Roeschke added 2 commits August 6, 2020 09:21

use iterable instead of list

7793b9d

Merge remote-tracking branch 'upstream/master' into cln/get_flattened…

58ddb7e

…_iterator

Merge remote-tracking branch 'upstream/master' into cln/get_flattened…

eb2be1b

…_iterator

jreback merged commit 25ff715 into pandas-dev:master Aug 6, 2020

mroeschke deleted the cln/get_flattened_iterator branch August 6, 2020 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CLN: get_flattened_iterator #35515

CLN: get_flattened_iterator #35515

Uh oh!

mroeschke commented Aug 2, 2020

Uh oh!

jbrockmendel Aug 2, 2020

Uh oh!

jbrockmendel commented Aug 4, 2020

Uh oh!

WillAyd Aug 4, 2020

Uh oh!

mroeschke Aug 5, 2020

Uh oh!

jreback Aug 5, 2020

Uh oh!

WillAyd Aug 5, 2020

Uh oh!

simonjayhawkins Aug 5, 2020

Uh oh!

simonjayhawkins Aug 6, 2020

Uh oh!

jreback commented Aug 6, 2020

Uh oh!

mroeschke commented Aug 6, 2020

Uh oh!

WillAyd commented Aug 6, 2020 via email

Uh oh!

jreback commented Aug 6, 2020

Uh oh!

Uh oh!

	arrays: Dict[int, List] = defaultdict(list)
	arrays: DefaultDict[int, List] = defaultdict(list)

Uh oh!

CLN: get_flattened_iterator #35515

CLN: get_flattened_iterator #35515

Uh oh!

Conversation

mroeschke commented Aug 2, 2020

Uh oh!

jbrockmendel Aug 2, 2020

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Aug 4, 2020

Uh oh!

WillAyd Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

mroeschke Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

WillAyd Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 6, 2020

Uh oh!

mroeschke commented Aug 6, 2020

Uh oh!

WillAyd commented Aug 6, 2020 via email

Uh oh!

jreback commented Aug 6, 2020

Uh oh!

Uh oh!