Skip to content

ENH: Optimize putmask implementation for CoW #51268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 10, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 9, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a EABackedBlock.setitem as well, that probably needs to be updated to get the same signature as Block.setitem?

@@ -974,10 +974,19 @@ def setitem(self, indexer, value) -> Block:
# checking lib.is_scalar here fails on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on it, but a few lines above we have:

        except LossySetitemError:
            # current dtype cannot store value, coerce to common dtype
            nb = self.coerce_to_target_dtype(value)
            return nb.setitem(indexer, value)

Is it theoretically possible that the dtype cannot store the value, but that the astype operation can still be done with a view? And should we in that case pass through using_cow=using_cow to the recursive setitem call?

I can't directly think of a case (maybe an EA that is backed by object array but can't hold the object you are setting?), but maybe the safest option is to just pass it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I was thinking about it this as well, but I could not think of a case where we would get a LossySetitemError and then would not make a copy in the astype. Can add it if you prefer anyways


if using_cow and self.refs.has_reference():
values = values.copy()
self = type(self)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we have a mix of using type(self)(..) and self.make_block_same_class(..). The latter avoids having to pass mgr_locs and ndim)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, switched.

Comment on lines +1017 to 1019
if using_cow:
return [self.copy(deep=False)]
return [self]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also doesn't necessarily harm to always do return [self.copy(deep=False)] to avoid the extra if branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but it could also mess up something I am not thinking about right now. I‘d like to keep as is if you don’t feel strongly about it. Could our caching mechanism be a problem (only theoretically, because we should clear the cache when getting here I think), but there might be edge cases we are missing right now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to leave as is for now. It will also be an easy clean-up once CoW is enforced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly my thinking as well

putmask_without_repeat(values.T, mask, casted)
if using_cow:
return [self.copy(deep=False)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to shallow copy if we already did a hard copy above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but the intermediate block will go out of scope, so making a shallow copy here will still only have a single ref in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, it does not hurt, that’s why I wanted to avoid making the check more complex

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one typing error:

pandas/core/internals/blocks.py:1615: error: Incompatible types in assignment (expression has type "Block", variable has type "EABackedBlock") [assignment]

@phofl phofl added this to the 2.0 milestone Feb 10, 2023
@jorisvandenbossche
Copy link
Member

The failing test in the CoW build is the flaky test_file_descriptor_leak

@phofl
Copy link
Member Author

phofl commented Feb 10, 2023

CoW failure unrelated (flaky test)

@jorisvandenbossche jorisvandenbossche merged commit fe0cc48 into pandas-dev:main Feb 10, 2023
@phofl
Copy link
Member Author

phofl commented Feb 10, 2023

:) funny timing

@phofl phofl deleted the cow_putmask branch February 10, 2023 18:35
@jorisvandenbossche
Copy link
Member

I was first! ;)

phofl added a commit to phofl/pandas that referenced this pull request Feb 10, 2023
@phofl
Copy link
Member Author

phofl commented Feb 10, 2023

Yep! was a tad to slow :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants