Skip to content

PERF: period factorization #14348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Oct 5, 2016

Merged in #14419


asv

   before     after       ratio
  [c41c6511] [96b364a4]
-     2.44s    46.28ms      0.02  groupby.groupby_period.time_groupby_sum

cc @sinhrks , @bmoscon

@codecov-io
Copy link

codecov-io commented Oct 5, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14348 into master will decrease coverage by <.01%

@@             master     #14348   diff @@
==========================================
  Files           140        140          
  Lines         50634      50634          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43173      43172     -1   
- Misses         7461       7462     +1   
  Partials          0          0          

Powered by Codecov. Last update d98e982...a77ccc2

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Period Period data type labels Oct 5, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 5, 2016
@@ -311,10 +313,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
uniques, labels = safe_sort(uniques, labels, na_sentinel=na_sentinel,
assume_unique=True)

if is_datetimetz_type:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious from the diff, but this check was redundant because the datetimetz values will get re-boxed at
https://github.com/chris-b1/pandas/blob/cf081d9f0216fe7c069369f2673ffde2db669704/pandas/core/algorithms.py#L320

# GH 14338
goal_time = 0.2

def setup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add similar benches for dti and dti w/tz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally inherit from a common class

if is_period_dtype(values):
values = PeriodIndex(values)
# period array interface goes to object so intercept
vals = values.view(np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use values.asi8

values = DatetimeIndex(values)
vals = values.asi8

if is_period_dtype(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an elif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment here about what is going on (helpful to future readers)

Copy link
Contributor

@jreback jreback Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this might be better to do something like this:

start around line 290

if needs_i8_conversion(values):
    dtype = values.dtype
    values = values.asi8
else:
    dtype = None
    values = np.asarray(values)

then the reverse conversion is

if dtype is not None:
     uniques = uniques.astype(dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something like this, but it still ends up being fairly complex because ext dtypes need special casing, let me know if I'm missing a clearer way to do it.

@bmoscon
Copy link
Contributor

bmoscon commented Oct 5, 2016

thanks for taking care of this. It looks like this is targeted for 0.19.1; any rough idea when that will be released?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 6, 2016

Looks good to me. We only should include a whatsnew notice, depending on the outcome of #14352 (adding a whatsnew file to put it in in #14366)

@chris-b1 chris-b1 force-pushed the period-factorization branch from 855357b to a77ccc2 Compare October 12, 2016 00:14
@jorisvandenbossche
Copy link
Member

Looks good to me! @jreback

@jorisvandenbossche
Copy link
Member

@chris-b1 whoops, sorry, closed this and I don't see a reopen button ..

@chris-b1
Copy link
Contributor Author

Strange, I can't reopen either, will just make a new PR

@chris-b1 chris-b1 mentioned this pull request Oct 13, 2016
4 tasks
@chris-b1
Copy link
Contributor Author

#14419

@jorisvandenbossche
Copy link
Member

Possibly some glitches due to the repo transfer. Older PRs are also stuck on "Checking for ability to merge automatically…" (that was the reason I wanted to close/reopen this one to retrigger it)

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, 0.19.1 Oct 13, 2016
@chris-b1 chris-b1 deleted the period-factorization branch November 30, 2016 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Period factorization very slow in 0.19.0
5 participants