Skip to content

BUG: passing DataFrame to make_block silently raises #28275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

jbrockmendel
Copy link
Member

ATM in the except NotImplementedError: branch of _cython_agg_blocks, we take a difference approach to the operation. if that approach succeeds, we end up passing a DataFrame to make_block on L189, which will raise ValueError. As a result, we'll end up falling back to python-space for the entire operation, which presumably entails a performance hit.

This fixes the incorrect passing of DataFrame, but the fix is kind of kludgy. Suggestions welcome on how to improve it.

@WillAyd
Copy link
Member

WillAyd commented Sep 4, 2019

Hmm so you are saying this improves mixed block performance where some can be cythonized right?

This adds a little bit of logic so would be good to see that measured somehow to understand if it’s worth it

@jbrockmendel
Copy link
Member Author

Hmm so you are saying this improves mixed block performance where some can be cythonized right?

I'm having a hard time cooking up an example, but roughly what I have in mind is a case with

  1. a bunch of non-EA blocks and then
  2. one EA block such that
    a) NotImplementedError is raised at L164 (L160 in PR) and
    b) the fallback on L179 (L180 in PR) goes through OK, so that
    c) in master, we get a ValueError when passing result to block.make_block on L180

In such a case we would end up falling back to the non-cython implementations for all blocks instead of just the one relevant EA block.

try:
result = s.aggregate(lambda x: alt(x, axis=self.axis))
except TypeError:
# we may have an exception in trying to aggregate
# continue and exclude the block
deleted_items.append(locs)
continue

if is_object_dtype(block.dtype) and how in ["prod", "cumprod", "sum"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a way around is to have .aggregate different exceptions for flow control (make some private ones that inherit from NotImplementedError); we do this in the cython aggregator now (well we raise NotImplementedError)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, distinguishing between our NotImplementedError vs numpy's is basically what #28198 does (albeit that is only in one place). I think the groupby catching/raising is complicated enough that we need to do a few passes clearing up what gets caught where, then go through to do the catching in fewer places

@jbrockmendel
Copy link
Member Author

Closing to clear the queue. Will revisit.

@jbrockmendel jbrockmendel deleted the gbtry2 branch April 5, 2020 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants