-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Implement DataFrame-with-scalar ops block-wise #28583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-10-09 23:58:19 UTC |
@@ -76,6 +76,17 @@ | |||
""" | |||
|
|||
|
|||
def compat_2d(meth): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this is going to be applied to other methods (else you'd just do the reshaping in _add_offset)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this at a stage when it was needed in several places (for e.g. getting __repr__
to work in debugging), then trimmed the usages back to the bare minimum. So we could get by without it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, OK good to know. Let's hear what others prefer (my slight preference is to inline it in the one place it's used).
return result.reshape(self.shape) | ||
return meth(self, *args, **kwargs) | ||
|
||
new_meth.__name__ = meth.__name__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use functools.wraps instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
# 2-dim | ||
DatetimeArray(arr.reshape(2, 2)) | ||
# 2-dim allowed for ops compat | ||
DatetimeArray(arr.reshape(2, 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we are to make any assertion about the ._data
of this result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will update
assert new_vals.shape == (blk.shape[-1],) | ||
nb = make_block(new_vals, placement=blk.mgr_locs, ndim=2) | ||
new_blocks.append(nb) | ||
elif blk.values.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: what hits this case? An op on an EA that returns an ndarray? Say DataFrame[Categorical] == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Block[Categorical].values == 0
needs to become a 2D ndarray[bool] block before we're done.
new_vals = array_op(blk_vals, right, func, str_rep, eval_kwargs) | ||
|
||
# Reshape for EA Block | ||
if is_extension_array_dtype(new_vals.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding: the if
and the elif
would be unnecessary if all EAs were allowed to be 2D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Everything from 489-524 would boil down to something like:
new_vals = array_op(blk.values.T, right, func, str_rep, eval_kwargs)
nb = blk.make_block(new_vals.T)
new_blocks.append(nb)
There are three more cases after this: Series[axis="rows"], Series[axis="columns], and DataFrame. The axis="rows" case can share most of the implementation here. The other two cases not so much. Question: do we want to do this in 4ish steps, or iterate in this PR until we get them all implemented? |
@@ -361,7 +372,7 @@ def __init__(self, values, dtype=_NS_DTYPE, freq=None, copy=False): | |||
"ndarray, or Series or Index containing one of those." | |||
) | |||
raise ValueError(msg.format(type(values).__name__)) | |||
if values.ndim != 1: | |||
if values.ndim not in [1, 2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this changed? Seems to conflict with the error message directly below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applicable in a few places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, this is kludge-adjacent. We don't really support 2D, so don't to tell users its OK.
pandas/core/frame.py
Outdated
@@ -5290,9 +5290,11 @@ def _combine_match_columns(self, other: Series, func, level=None): | |||
new_data = ops.dispatch_to_series(left, right, func, axis="columns") | |||
return left._construct_result(right, new_data, func) | |||
|
|||
def _combine_const(self, other, func): | |||
def _combine_const(self, other, func, str_rep=None, eval_kwargs=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate new parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
mgr = left._data | ||
for blk in mgr.blocks: | ||
# Reshape for EA Block | ||
blk_vals = blk.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be using to_numpy
instead of .values
? Not super well versed on what types actually get preserved this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we specifically want to get the .values attribute, which can be either a ndarray or EA (also I dont think Block has to_numpy)
will have a look soon. |
@@ -511,7 +562,7 @@ def wrapper(self, other): | |||
lvalues = extract_array(self, extract_numpy=True) | |||
rvalues = extract_array(other, extract_numpy=True) | |||
|
|||
res_values = comparison_op(lvalues, rvalues, op) | |||
res_values = comparison_op(lvalues, rvalues, op, None, {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, I'd prefer the last two argument to be keyword arguments (It's relatively self-evident what the first three arguments for, but that's IMO not the case for the last two...). Same for calls to comparison_op
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -217,7 +227,7 @@ def arithmetic_op( | |||
|
|||
|
|||
def comparison_op( | |||
left: Union[np.ndarray, ABCExtensionArray], right: Any, op | |||
left: Union[np.ndarray, ABCExtensionArray], right: Any, op, str_rep, eval_kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str_rep
and eval_kwargs
don't seem to be used here, is that right? Could they be removed, or have default arguments of None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if they need to stay, could you add type hints in order to ease understanding? I assume they have types Optional[str]
and dict
, respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add types, double-check whether these can be removed.
if lib.is_scalar(right) or np.ndim(right) == 0: | ||
|
||
new_blocks = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather actually actualy do this in the block manager no? (maybe put in internals/ops.py) we should actually move all block type ops there (e.g. from groupby as well). OR isolate this code in ops somewhere (maybe internals.py), so things are separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally agree on getting Block/BlockManager-touching things isolated in internals. Will give some thought to how/when to move this. ATM this is in Proof of Concept phase while I figure out how to handle the remaining cases.
Closing to clear the queue; I'll be opening something similar (but implemented largely in internals) before long. |
One of four cases we'll need to implement (the others being Series-align-index, Series-align-columns, and DataFrame).
~670x speedup on the fastest ops, ~8x on the slower end.
The 2D DTA and TDAs that get created exist only briefly, and the changes to these classes are about as minimal as I can make them.