Skip to content

CLN: get_flattened_iterator #35515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 6, 2020

Conversation

mroeschke
Copy link
Member

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Refactor to just a single function

@mroeschke mroeschke added the Clean label Aug 2, 2020
@mroeschke mroeschke added this to the 1.2 milestone Aug 2, 2020
# provide "flattened" iterator for multi-group setting
mapper = _KeyMapper(comp_ids, ngroups, levels, labels)
return [mapper.get_key(i) for i in range(ngroups)]
def get_flattened_iterator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good cleanup here, but how about make this get_flattened_list and not need to cast in the calling function?

@jbrockmendel
Copy link
Member

LGTM

table.map(comp_ids, labs.astype(np.int64, copy=False))
tables.append(table)
return [
tuple(level[table.get_item(i)] for table, level in zip(tables, levels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any performance difference in creating an intermediary list to store the result of zip(tables, levels) rather than doing it only the fly in each iteration here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Eliminates a loop iteration and storing these table objects

tables = []
for labs, level in zip(labels, levels):
table = hashtable.Int64HashTable(ngroups)
table.map(comp_ids, labs.astype(np.int64, copy=False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could make this a list-comprehension, maybe it would be slightly less readable though

) -> List[Tuple]:
"""Map compressed group id -> key tuple."""
comp_ids = comp_ids.astype(np.int64, copy=False)
arrays: Dict[int, List] = defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arrays: Dict[int, List] = defaultdict(list)
arrays: DefaultDict[int, List] = defaultdict(list)

This exists in the typing module. Can the List be List[int]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the List be List[int]?

#30539 (comment)

mapper = _KeyMapper(comp_ids, ngroups, levels, labels)
return [mapper.get_key(i) for i in range(ngroups)]
def get_flattened_list(
comp_ids: np.ndarray, ngroups: int, levels: List["Index"], labels: List[np.ndarray]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md

avoid invariant collection types (List, Dict) in argument positions, in favor of covariant types like Mapping or Sequence;

in this case both levels and labels are used in zip, so Iterable is more permissive.

@jreback
Copy link
Contributor

jreback commented Aug 6, 2020

can you also run the groupby asvs (in particular the ones that exercise the full int64 cardinaility) to assure no perf diffs.

@mroeschke
Copy link
Member Author

This small benchmark shows no performance degradation:

In [1]: df = pd.DataFrame({'a': list(range(1000)) * 2, 'b': list(range(1000)) * 2})

In [4]: %timeit df.groupby('a').sum()
1.06 ms ± 74.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <--PR


In [4]: %timeit df.groupby('a').sum()
1.1 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <-- master

I'm getting this error when trying to run the asvs locally

STDERR -------->
   /Users/matthewroeschke/pandas-mroeschke/asv_bench/env/a36d556a9d1611aba4092dc48036d261/lib/python3.6/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
     "Distutils was imported before Setuptools. This usage is discouraged "
   warning: pandas/_libs/groupby.pyx:1130:26: Unreachable code
   pandas/_libs/parsers.c:49607:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/parsers.c:49607:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/ops_dispatch.c:3877:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/ops_dispatch.c:3877:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/offsets.c:89935:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/offsets.c:89935:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/parsing.c:39517:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/parsing.c:39517:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/period.c:44813:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/period.c:44813:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/timedeltas.c:43722:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/timedeltas.c:43722:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   pandas/_libs/tslibs/timestamps.c:41544:52: error: code will never be executed [-Werror,-Wunreachable-code]
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                                      ^~~~
   pandas/_libs/tslibs/timestamps.c:41544:38: note: silence by adding parentheses to mark code as explicitly dead
           } else if (PY_VERSION_HEX >= 0x030700A0 && flag == (METH_FASTCALL | METH_KEYWORDS)) {
                                        ^
                                        /* DISABLES CODE */ ( )
   1 error generated.
   error: command 'gcc' failed with exit status 1

·· Failed to build the project and import the benchmark suite.

@WillAyd
Copy link
Member

WillAyd commented Aug 6, 2020 via email

@jreback jreback merged commit 25ff715 into pandas-dev:master Aug 6, 2020
@jreback
Copy link
Contributor

jreback commented Aug 6, 2020

thanks @mroeschke

@mroeschke mroeschke deleted the cln/get_flattened_iterator branch August 6, 2020 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants