Skip to content

PERF: PeriodIndex.dropna, difference, take #23095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions asv_bench/benchmarks/period.py
Original file line number Diff line number Diff line change
@@ -96,6 +96,8 @@ def time_value_counts(self, typ):
class Indexing(object):

goal_time = 0.2
pi_with_nas = PeriodIndex(['1985Q1', 'NaT', '1985Q2'] * 1000, freq='Q')
pi_diff = PeriodIndex(['1985Q1', 'NaT', '1985Q3'] * 1000, freq='Q')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup gets called with every round of time_foo. Doing this here is analogous to calling setup_class once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_cache pickles and then loads in each step of the loop


def setup(self):
self.index = PeriodIndex(start='1985', periods=1000, freq='D')
@@ -121,4 +123,17 @@ def time_intersection(self):
self.index[:750].intersection(self.index[250:])

def time_unique(self):
# GH#23083
self.index.unique()

def time_dropna(self):
# GH#23095
self.pi_with_nas.dropna()

def time_difference(self):
# GH#23095
self.pi_with_nas.difference(self.pi_diff)

def time_symmetric_difference(self):
# GH#23095
self.pi_with_nas.symmetric_difference(self.pi_diff)
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
@@ -701,7 +701,7 @@ Performance Improvements
(:issue:`21372`)
- Improved the performance of :func:`pandas.get_dummies` with ``sparse=True`` (:issue:`21997`)
- Improved performance of :func:`IndexEngine.get_indexer_non_unique` for sorted, non-unique indexes (:issue:`9466`)
- Improved performance of :func:`PeriodIndex.unique` (:issue:`23083`)
- Improved performance of :func:`PeriodIndex.unique`, :func:`PeriodIndex.difference`, :func:`PeriodIndex.symmetric_difference`, and :func:`PeriodIndex.dropna`, (:issue:`23083`, :issue:`23095`)


.. _whatsnew_0240.docs:
32 changes: 23 additions & 9 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
@@ -519,7 +519,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs):
@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
values = self.values
values = self._take_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? What has take to do with shallow_copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about passing i8 instead of object to _simple_new

attributes = self._get_attributes_dict()
attributes.update(kwargs)
if not len(values) and 'dtype' not in kwargs:
@@ -763,6 +763,17 @@ def get_values(self):
"""
return self.values

@property
def _take_values(self):
"""
Values to use in `take` operation, suitable to being passed back to
_shallow_copy/_simple_new.
"""
if is_period_dtype(self):
# Avoid casting to object-dtype
return self._ndarray_values
return self.values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger @jorisvandenbossche is there a better name for this that might fit with EA naming conventions?

@jschendel should IntervalIndex be special-cased here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have all EAs, this will always dispatch to the EA, and then we don't need a _take_values anymore?
(if that is the case, I wouldn't care too much about the name now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel should IntervalIndex be special-cased here?

The implementation should be good as-is for IntervalIndex; IntervalIndex.values is an IntervalArray, which has a special-cased take method. IntervalIndex._shallow_copy/_simple_new should accept an IntervalArray (which we'd get from take), so should be good on that front.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically .asi8 on Indexes now, yes? Is there a reason we are not aligning on that? or changing .asi8 to this?


@Appender(IndexOpsMixin.memory_usage.__doc__)
def memory_usage(self, deep=False):
result = super(Index, self).memory_usage(deep=deep)
@@ -2158,7 +2169,7 @@ def take(self, indices, axis=0, allow_fill=True,
if allow_fill and fill_value is not None:
msg = 'Unable to fill values because {0} cannot contain NA'
raise ValueError(msg.format(self.__class__.__name__))
taken = self.values.take(indices)
taken = self._take_values.take(indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeriodIndex can hold NA values, so how is influencing PeriodIndex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_take_values will be iNaT in locations where values are NaT, after passing to shallow_copy/simple_new, we get NaT back

return self._shallow_copy(taken)

def _assert_take_fillable(self, values, indices, allow_fill=True,
@@ -2929,7 +2940,8 @@ def difference(self, other):
self._assert_can_do_setop(other)

if self.equals(other):
return self._shallow_copy([])
# pass an empty array with the appropriate dtype
return self._shallow_copy(self[:0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add a test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this was needed due to the change in shallow_copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just because _shallow_copy calls _simple_new, and being "simple" should not be getting a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not add a test, but can confirm this line is hit in the tests.


other, result_name = self._convert_can_do_setop(other)

@@ -2940,12 +2952,14 @@ def difference(self, other):

label_diff = np.setdiff1d(np.arange(this.size), indexer,
assume_unique=True)
the_diff = this.values.take(label_diff)
the_diff = this._take_values.take(label_diff)
try:
the_diff = sorting.safe_sort(the_diff)
except TypeError:
pass

if is_period_dtype(this):
return this._shallow_copy(the_diff, name=result_name)
return this._shallow_copy(the_diff, name=result_name, freq=None)

def symmetric_difference(self, other, result_name=None):
@@ -2994,11 +3008,11 @@ def symmetric_difference(self, other, result_name=None):
common_indexer = indexer.take((indexer != -1).nonzero()[0])
left_indexer = np.setdiff1d(np.arange(this.size), common_indexer,
assume_unique=True)
left_diff = this.values.take(left_indexer)
left_diff = this._take_values.take(left_indexer)

# {other} minus {this}
right_indexer = (indexer == -1).nonzero()[0]
right_diff = other.values.take(right_indexer)
right_diff = other._take_values.take(right_indexer)

the_diff = _concat._concat_compat([left_diff, right_diff])
try:
@@ -3008,7 +3022,7 @@ def symmetric_difference(self, other, result_name=None):

attribs = self._get_attributes_dict()
attribs['name'] = result_name
if 'freq' in attribs:
if 'freq' in attribs and not is_period_dtype(self):
attribs['freq'] = None
return self._shallow_copy_with_infer(the_diff, **attribs)

@@ -3028,7 +3042,7 @@ def _get_unique_index(self, dropna=False):
if self.is_unique and not dropna:
return self

values = self.values
values = self._take_values

if not self.is_unique:
values = self.unique()
@@ -4678,7 +4692,7 @@ def dropna(self, how='any'):
raise ValueError("invalid how option: {0}".format(how))

if self.hasnans:
return self._shallow_copy(self.values[~self._isnan])
return self._shallow_copy(self._take_values[~self._isnan])
return self._shallow_copy()

def _evaluate_with_timedelta_like(self, other, op):
10 changes: 9 additions & 1 deletion pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
@@ -17,6 +17,7 @@
from pandas.core.dtypes.common import (
ensure_int64,
ensure_platform_int,
is_period_dtype,
is_categorical_dtype,
is_object_dtype,
is_hashable,
@@ -1036,7 +1037,14 @@ def _get_level_values(self, level, unique=False):
labels = self.labels[level]
if unique:
labels = algos.unique(labels)
filled = algos.take_1d(values._values, labels,

if is_period_dtype(values):
take_vals = values._ndarray_values
else:
# TODO: Why is this _values where elsewhere it is values?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably responsible for using values._values here instead of .values. AFAIK, they should be equivalent for a MultiIndex (an ndarray of tuples).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll clear this out in the next pass, assuming this goes through.

# if it were values here we could use _take_values
take_vals = values._values
filled = algos.take_1d(take_vals, labels,
fill_value=values._na_value)
values = values._shallow_copy(filled)
return values
6 changes: 3 additions & 3 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
@@ -327,9 +327,9 @@ def __array_wrap__(self, result, context=None):
"""
if isinstance(context, tuple) and len(context) > 0:
func = context[0]
if (func is np.add):
if func is np.add:
pass
elif (func is np.subtract):
elif func is np.subtract:
name = self.name
left = context[1][0]
right = context[1][1]
@@ -350,7 +350,7 @@ def __array_wrap__(self, result, context=None):
return result
# the result is object dtype array of Period
# cannot pass _simple_new as it is
return self._shallow_copy(result, freq=self.freq, name=self.name)
return type(self)(result, freq=self.freq, name=self.name)

@property
def size(self):
3 changes: 2 additions & 1 deletion pandas/util/testing.py
Original file line number Diff line number Diff line change
@@ -824,7 +824,8 @@ def _get_ilevel_values(index, level):
# accept level number only
unique = index.levels[level]
labels = index.labels[level]
filled = take_1d(unique.values, labels, fill_value=unique._na_value)
filled = take_1d(unique._take_values, labels,
fill_value=unique._na_value)
values = unique._shallow_copy(filled, name=index.names[level])
return values