Skip to content

PERF: improve DTI string parse #13692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jul 18, 2016

cleaned up DatetimeIndex constructor removing slower string-parsing path.

Performance Improvement

related to #7599, internally use to_datetime always as it tries some fastpath.

inp = np.array(['2011-01-01 09:00' for i in range(10000)])

# on current master
%timeit pd.DatetimeIndex(inp)
1 loops, best of 3: 3.41 s per loop
%timeit pd.to_datetime(inp)
100 loops, best of 3: 4.77 ms per loop

# after the PR
%timeit pd.DatetimeIndex(inp)
#100 loops, best of 3: 4.23 ms per loop
%timeit pd.to_datetime(inp)
#100 loops, best of 3: 4.25 ms per loop

Bug Fixes

The cleanup fixed these 2 kind of issues:

1. #11169 and #11287 Invalid string parsing may raise TypeError

I met the same issue on travis and fixed with try-except clause (I can't reproduce it on my local Mac).

2. Index may incorrectly coerces mismatched tz

on current master, DatetimeIndex and normal Index behaves differently.

# OK
pd.DatetimeIndex([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
# TypeError: Already tz-aware, use tz_convert to convert.

# NG, it ignores mismatch and coerce to passed tz
pd.Index([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
DatetimeIndex(['2010-12-31 21:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

after the PR both behave the same, showing understandable error.

pd.Index([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
# TypeError: data is already tz-aware US/Eastern, unable to set specified tz: US/Pacific

pd.DatetimeIndex([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
# TypeError: data is already tz-aware US/Eastern, unable to set specified tz: US/Pacific

@sinhrks sinhrks added Datetime Datetime data dtype Performance Memory or execution speed performance Timezones Timezone data dtype labels Jul 18, 2016
@sinhrks sinhrks added this to the 0.19.0 milestone Jul 18, 2016
@@ -1046,7 +1046,12 @@ def _get_binner_for_grouping(self, obj):
l = []
for key, group in grouper.get_iterator(self.ax):
l.extend([key] * len(group))
grouper = binner.__class__(l, freq=binner.freq, name=binner.name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe isolate this a bit with some functions like _get_binner_for_resample does (maybe we should abstract this out even a bit more and have some PeriodIndexGrouper, DatetimeinexGrouper, which subclass TimeGrouper, but this might require some effort).

@codecov-io
Copy link

codecov-io commented Jul 18, 2016

Current coverage is 84.54%

Merging #13692 into master will decrease coverage by <.01%

@@             master     #13692   diff @@
==========================================
  Files           141        141          
  Lines         51185      51159    -26   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43275      43250    -25   
+ Misses         7910       7909     -1   
  Partials          0          0          

Powered by Codecov. Last updated by 506520b...8774772

if not (is_datetime64_dtype(data) or is_datetimetz(data) or
is_integer_dtype(data)):
data = tools.to_datetime(data, dayfirst=dayfirst,
yearfirst=yearfirst)

if issubclass(data.dtype.type, np.datetime64) or is_datetimetz(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use is_datetime64_dtype here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to normal datetime64, datetime64tz (DatetimeIndex) and int can be directly converted to DatetimeIndex.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2016

@sinhrks this cleans up lots of junk!

just a couple of clarifications.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

lgtm. ready to go?

@sinhrks
Copy link
Member Author

sinhrks commented Jul 19, 2016

yes, it's ready:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Timezones Timezone data dtype
Projects
None yet
3 participants