Skip to content

PERF: Implement DataFrame-with-scalar ops block-wise #28583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 36 commits into from

Conversation

jbrockmendel
Copy link
Member

One of four cases we'll need to implement (the others being Series-align-index, Series-align-columns, and DataFrame).

~670x speedup on the fastest ops, ~8x on the slower end.

In [3]: arr = np.arange(10**5).reshape(100, 1000)                                                               
In [4]: df = pd.DataFrame(arr)                                                                                  
In [5]: %timeit df + 1                                              
198 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)         # <-- master                                            
294 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

In [8]: ts = pd.Timestamp.now("UTC")                                                                            
In [9]: df2 = pd.DataFrame(arr.view("timedelta64[ns]"))                                                         
In [10]: %timeit ts - df2                                                                                       
319 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      # <-- master
40.2 ms ± 622 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   # <-- PR

The 2D DTA and TDAs that get created exist only briefly, and the changes to these classes are about as minimal as I can make them.

@pep8speaks
Copy link

pep8speaks commented Sep 23, 2019

Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-09 23:58:19 UTC

@@ -76,6 +76,17 @@
"""


def compat_2d(meth):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is going to be applied to other methods (else you'd just do the reshaping in _add_offset)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this at a stage when it was needed in several places (for e.g. getting __repr__ to work in debugging), then trimmed the usages back to the bare minimum. So we could get by without it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OK good to know. Let's hear what others prefer (my slight preference is to inline it in the one place it's used).

return result.reshape(self.shape)
return meth(self, *args, **kwargs)

new_meth.__name__ = meth.__name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use functools.wraps instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

# 2-dim
DatetimeArray(arr.reshape(2, 2))
# 2-dim allowed for ops compat
DatetimeArray(arr.reshape(2, 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we are to make any assertion about the ._data of this result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update

assert new_vals.shape == (blk.shape[-1],)
nb = make_block(new_vals, placement=blk.mgr_locs, ndim=2)
new_blocks.append(nb)
elif blk.values.ndim == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: what hits this case? An op on an EA that returns an ndarray? Say DataFrame[Categorical] == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Block[Categorical].values == 0 needs to become a 2D ndarray[bool] block before we're done.

new_vals = array_op(blk_vals, right, func, str_rep, eval_kwargs)

# Reshape for EA Block
if is_extension_array_dtype(new_vals.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding: the if and the elif would be unnecessary if all EAs were allowed to be 2D?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Everything from 489-524 would boil down to something like:

new_vals = array_op(blk.values.T, right, func, str_rep, eval_kwargs)
nb = blk.make_block(new_vals.T)
new_blocks.append(nb)

@jbrockmendel
Copy link
Member Author

There are three more cases after this: Series[axis="rows"], Series[axis="columns], and DataFrame. The axis="rows" case can share most of the implementation here. The other two cases not so much.

Question: do we want to do this in 4ish steps, or iterate in this PR until we get them all implemented?

@@ -361,7 +372,7 @@ def __init__(self, values, dtype=_NS_DTYPE, freq=None, copy=False):
"ndarray, or Series or Index containing one of those."
)
raise ValueError(msg.format(type(values).__name__))
if values.ndim != 1:
if values.ndim not in [1, 2]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed? Seems to conflict with the error message directly below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applicable in a few places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, this is kludge-adjacent. We don't really support 2D, so don't to tell users its OK.

@@ -5290,9 +5290,11 @@ def _combine_match_columns(self, other: Series, func, level=None):
new_data = ops.dispatch_to_series(left, right, func, axis="columns")
return left._construct_result(right, new_data, func)

def _combine_const(self, other, func):
def _combine_const(self, other, func, str_rep=None, eval_kwargs=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate new parameters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

mgr = left._data
for blk in mgr.blocks:
# Reshape for EA Block
blk_vals = blk.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using to_numpy instead of .values? Not super well versed on what types actually get preserved this way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we specifically want to get the .values attribute, which can be either a ndarray or EA (also I dont think Block has to_numpy)

@jreback
Copy link
Contributor

jreback commented Sep 24, 2019

will have a look soon.

@@ -511,7 +562,7 @@ def wrapper(self, other):
lvalues = extract_array(self, extract_numpy=True)
rvalues = extract_array(other, extract_numpy=True)

res_values = comparison_op(lvalues, rvalues, op)
res_values = comparison_op(lvalues, rvalues, op, None, {})
Copy link
Contributor

@topper-123 topper-123 Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I'd prefer the last two argument to be keyword arguments (It's relatively self-evident what the first three arguments for, but that's IMO not the case for the last two...). Same for calls to comparison_op below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -217,7 +227,7 @@ def arithmetic_op(


def comparison_op(
left: Union[np.ndarray, ABCExtensionArray], right: Any, op
left: Union[np.ndarray, ABCExtensionArray], right: Any, op, str_rep, eval_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str_rep and eval_kwargs don't seem to be used here, is that right? Could they be removed, or have default arguments of None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if they need to stay, could you add type hints in order to ease understanding? I assume they have types Optional[str] and dict, respectively?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add types, double-check whether these can be removed.

if lib.is_scalar(right) or np.ndim(right) == 0:

new_blocks = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather actually actualy do this in the block manager no? (maybe put in internals/ops.py) we should actually move all block type ops there (e.g. from groupby as well). OR isolate this code in ops somewhere (maybe internals.py), so things are separated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree on getting Block/BlockManager-touching things isolated in internals. Will give some thought to how/when to move this. ATM this is in Proof of Concept phase while I figure out how to handle the remaining cases.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Oct 2, 2019
@jbrockmendel
Copy link
Member Author

Closing to clear the queue; I'll be opening something similar (but implemented largely in internals) before long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants