-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
[ArrayManager] REF: Implement concat with reindexing #39612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
6cd1c4d
73d9de2
272d674
555d7ac
ebad8a4
ee495e5
19c7f75
42e1b05
724be3e
db3f0ed
a2aa388
6bdd175
910e1fe
cab90f6
c22a010
eec0161
6c69869
04ead63
427b6f4
8c10a53
ec5bd11
f0061f7
9ba8854
ad61f2f
d960619
a3c2662
0fafb1a
f67e9e2
f655e33
d21bd3a
9435c39
22ea7d2
77b05f4
81d0954
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -5,27 +5,90 @@ | |||||||||
|
||||||||||
import numpy as np | ||||||||||
|
||||||||||
from pandas._libs import lib | ||||||||||
from pandas._typing import ArrayLike, DtypeObj | ||||||||||
|
||||||||||
from pandas.core.dtypes.cast import find_common_type | ||||||||||
from pandas.core.dtypes.common import ( | ||||||||||
is_bool_dtype, | ||||||||||
is_categorical_dtype, | ||||||||||
is_dtype_equal, | ||||||||||
is_extension_array_dtype, | ||||||||||
is_integer_dtype, | ||||||||||
is_sparse, | ||||||||||
) | ||||||||||
from pandas.core.dtypes.generic import ABCCategoricalIndex, ABCSeries | ||||||||||
from pandas.core.dtypes.missing import na_value_for_dtype | ||||||||||
|
||||||||||
from pandas.core.arrays import ExtensionArray | ||||||||||
from pandas.core.arrays.sparse import SparseArray | ||||||||||
from pandas.core.construction import array, ensure_wrapped_if_datetimelike | ||||||||||
|
||||||||||
|
||||||||||
class NullArrayProxy: | ||||||||||
""" | ||||||||||
Proxy object for an all-NA array. | ||||||||||
|
||||||||||
Only stores the length of the array, and not the dtype. The dtype | ||||||||||
will only be known when actually concatenating (after determining the | ||||||||||
common dtype, for which this proxy is ignored). | ||||||||||
Using this object avoids that the internals/concat.py needs to determine | ||||||||||
the proper dtype and array type. | ||||||||||
|
Using this object avoids that the internals/concat.py needs to determine | |
the proper dtype and array type. | |
Using this object simplifies the problem (in internals/concat.py) of determining the result dtype of a concatenation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strictly speaking not fully correct, as internals/concat.py
doesn't determine any result dtype itself. It just passes the arrays to dtypes/concat.py::concat_arrays
, which determines the dtype.
Was there something unclear in the original sentence that I can try to improve otherwise?
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
jreback marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this List[Any]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should be fine.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have different impl for array vs block manager then let's push this code down into those areas (so its localized). ok for a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is the logic is too scattered and very hard to grok where things are happening.
jbrockmendel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really similar to the non-AM code. it cant be shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my original comment about it at #39612 (comment)
Yes, it's really similar. But it's also slightly different in many places.
Now, when I tried this originally, I only checked for the null proxy object when needed (giving lots of if/elses, making it hard to read). But of course I could also simply always check for it. And for the non-ArrayManager cases this proxy object will never be present, so apart from doing a bunch of unnecessary checks, it shouldn't matter.
If you are OK with complicating the base concat_compat
with a bunch of handling for the proxy objects (even when not needed), I am also certainly fine with combining both and not writing the separate concat_arrays
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this behavior consistent between AM and BM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure anymore where I saw this, as for both BlockManager/ArrayManager this gives object dtype with both Series or DataFrame (at least with latest version of this PR). So will remove the comment ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, found my notes about this: this is not directly related to Block vs ArrayManager, but something I noticed here that is inconsistent between Series vs DataFrame and empty vs non-empty.
For Series, with non-empty, we actually coerce to numeric, while if we have empty bool + float, we get object dtype:
In [7]: pd.concat([pd.Series([True], dtype=bool), pd.Series([1.0], dtype=float)])
Out[7]:
0 1.0
0 1.0
dtype: float64
In [8]: pd.concat([pd.Series([], dtype=bool), pd.Series([], dtype=float)])
Out[8]: Series([], dtype: object)
While DataFrame uses object dtype in both cases.
(will open a separate issue about it)
But so the reason this came up specifically for ArrayManager, is that we now follow the same rules for Series/DataFrame, and thus also get numeric for non-empty for the DataFrame case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #39817 about this
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments above, the non-internals code should ideally not have these hacks here. This makes it really hard to follow anything. Can you do something about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I simply cut and paste this function to internals/concat.py
, and then all is "solved"? (then the null array proxy is not used outside of internals)
(note that my question is slightly ironic, because it doesn't change anything fundamentally, just about semantics what is considered "internal". But I don't really understand what the problem here is).
jbrockmendel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
is_extension_array_dtype, | ||
is_numeric_dtype, | ||
) | ||
from pandas.core.dtypes.concat import NullArrayProxy | ||
from pandas.core.dtypes.dtypes import ExtensionDtype, PandasDtype | ||
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries | ||
from pandas.core.dtypes.missing import isna | ||
|
@@ -725,10 +726,20 @@ def reindex_indexer( | |
# ignored keywords | ||
consolidate: bool = True, | ||
only_slice: bool = False, | ||
# ArrayManager specific keywords | ||
do_integrity_check=True, | ||
use_na_proxy=False, | ||
) -> T: | ||
axis = self._normalize_axis(axis) | ||
return self._reindex_indexer( | ||
new_axis, indexer, axis, fill_value, allow_dups, copy | ||
new_axis, | ||
indexer, | ||
axis, | ||
fill_value, | ||
allow_dups, | ||
copy, | ||
do_integrity_check, | ||
use_na_proxy, | ||
) | ||
|
||
def _reindex_indexer( | ||
|
@@ -739,6 +750,8 @@ def _reindex_indexer( | |
fill_value=None, | ||
allow_dups: bool = False, | ||
copy: bool = True, | ||
do_integrity_check=True, | ||
use_na_proxy=False, | ||
|
||
) -> T: | ||
""" | ||
Parameters | ||
|
@@ -773,7 +786,9 @@ def _reindex_indexer( | |
new_arrays = [] | ||
for i in indexer: | ||
if i == -1: | ||
arr = self._make_na_array(fill_value=fill_value) | ||
arr = self._make_na_array( | ||
fill_value=fill_value, use_na_proxy=use_na_proxy | ||
) | ||
else: | ||
arr = self.arrays[i] | ||
new_arrays.append(arr) | ||
|
@@ -793,7 +808,7 @@ def _reindex_indexer( | |
new_axes = list(self._axes) | ||
new_axes[axis] = new_axis | ||
|
||
return type(self)(new_arrays, new_axes) | ||
return type(self)(new_arrays, new_axes, do_integrity_check=do_integrity_check) | ||
|
||
def take(self, indexer, axis: int = 1, verify: bool = True, convert: bool = True): | ||
""" | ||
|
@@ -820,10 +835,11 @@ def take(self, indexer, axis: int = 1, verify: bool = True, convert: bool = True | |
new_axis=new_labels, indexer=indexer, axis=axis, allow_dups=True | ||
) | ||
|
||
def _make_na_array(self, fill_value=None): | ||
def _make_na_array(self, fill_value=None, use_na_proxy=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need ArrayLike to include NullArrayProxy? |
||
if use_na_proxy: | ||
return NullArrayProxy(self.shape_proper[0]) | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if fill_value is None: | ||
fill_value = np.nan | ||
|
||
dtype, fill_value = infer_dtype_from_scalar(fill_value) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the rest of this method ndarray-only? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
values = np.empty(self.shape_proper[0], dtype=dtype) | ||
values.fill(fill_value) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,7 @@ | |
is_sparse, | ||
is_timedelta64_dtype, | ||
) | ||
from pandas.core.dtypes.concat import concat_compat | ||
from pandas.core.dtypes.concat import concat_arrays, concat_compat | ||
from pandas.core.dtypes.missing import isna_all | ||
|
||
import pandas.core.algorithms as algos | ||
|
@@ -37,6 +37,51 @@ | |
from pandas.core.arrays.sparse.dtype import SparseDtype | ||
|
||
|
||
def concatenate_array_managers( | ||
|
||
mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool | ||
) -> Manager: | ||
""" | ||
Concatenate array managers into one. | ||
|
||
Parameters | ||
---------- | ||
mgrs_indexers : list of (ArrayManager, {axis: indexer,...}) tuples | ||
axes : list of Index | ||
concat_axis : int | ||
copy : bool | ||
|
||
Returns | ||
------- | ||
ArrayManager | ||
""" | ||
# reindex all arrays | ||
mgrs = [] | ||
for mgr, indexers in mgrs_indexers: | ||
for ax, indexer in indexers.items(): | ||
mgr = mgr.reindex_indexer( | ||
axes[ax], | ||
indexer, | ||
axis=ax, | ||
allow_dups=True, | ||
do_integrity_check=False, | ||
use_na_proxy=True, | ||
) | ||
mgrs.append(mgr) | ||
|
||
# concatting along the rows -> concat the reindexed arrays | ||
if concat_axis == 1: | ||
arrays = [ | ||
concat_arrays([mgrs[i].arrays[j] for i in range(len(mgrs))]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you now remove concat_compat? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and reading your comment below, why is this not in array manger if its only used there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Because this is the code for concatting managers, which for BlockManager also resides in internals/concat.py? |
||
for j in range(len(mgrs[0].arrays)) | ||
] | ||
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False) | ||
# concatting along the columns -> combine reindexed arrays in a single manager | ||
else: | ||
assert concat_axis == 0 | ||
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs])) | ||
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False) | ||
|
||
|
||
def concatenate_block_managers( | ||
mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool | ||
) -> Manager: | ||
|
@@ -55,19 +100,7 @@ def concatenate_block_managers( | |
BlockManager | ||
""" | ||
if isinstance(mgrs_indexers[0][0], ArrayManager): | ||
|
||
if concat_axis == 1: | ||
# TODO for now only fastpath without indexers | ||
mgrs = [t[0] for t in mgrs_indexers] | ||
arrays = [ | ||
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))], axis=0) | ||
for j in range(len(mgrs[0].arrays)) | ||
] | ||
return ArrayManager(arrays, [axes[1], axes[0]]) | ||
elif concat_axis == 0: | ||
mgrs = [t[0] for t in mgrs_indexers] | ||
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs])) | ||
return ArrayManager(arrays, [axes[1], axes[0]]) | ||
return concatenate_array_managers(mgrs_indexers, axes, concat_axis, copy) | ||
|
||
concat_plans = [ | ||
_get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option if we dont want a whole new thing would be to use
np.empty(shape, dtype="V")
(which has obj.nbytes = 0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's indeed an option as well. It wouldn't really simplify code as we still need all of what is now in
to_array
and all the checking for this object/dtype inconcat_arrays
, but would indeed avoid the custom class.(Personally, I find the custom class a bit more explicit as using a dtype we otherwise don't use, so would choose that, but I am fine either way)