Skip to content

ERR: raise in tz incompat with DatetimeIndex ops #8306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
jreback opened this issue Sep 18, 2014 · 13 comments · Fixed by #21491
Closed
3 tasks

ERR: raise in tz incompat with DatetimeIndex ops #8306

jreback opened this issue Sep 18, 2014 · 13 comments · Fixed by #21491
Labels
API Design Datetime Datetime data dtype Error Reporting Incorrect or improved errors from pandas Timezones Timezone data dtype
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Sep 18, 2014

xref #6552

idx = pd.date_range(..., tz='UTC')
idx2 = idx.reindex(pd.date_range(..., tz='US/Pacific')
  • subtraction of datetime like
  • intersection / union / difference
  • reindex
@jreback
Copy link
Contributor Author

jreback commented Sep 18, 2014

cc @immerrr

@jreback jreback added API Design Error Reporting Incorrect or improved errors from pandas Timezones Timezone data dtype Datetime Datetime data dtype labels Sep 18, 2014
@jreback jreback added this to the 0.15.1 milestone Sep 18, 2014
@immerrr
Copy link
Contributor

immerrr commented Sep 18, 2014

I think that the point in raising in such situations is to prevent guessing in the face of ambiguity.

There's an ambiguity when you do pretty much any binary operation between a naive and an aware timestamps as the former can correspond to about 40 different wallclock timestamps and it's unclear which one exactly, so this should be reported right away just as with vanilla python datetime.

I have doubts about "subtraction of datetime like" though as there's no ambiguity in subtracting two aware timestamps from different timezones. There also may be very annoying issues because AFAIR dst transition involves changing timezones and subtracting a properly localized summer timestamp from the winter one will raise then.

"intersection / union / difference" is ambiguous w.r.t what should be the tz of the result, seems ok to raise.

As for "reindex", when I'm doing output, reindexer = index.reindex(new_index), I expect output to contain the same elements as in new_index which in case of tz-aware dt-index includes the timezone, so I don't see anything ambiguous here and the operation feels fine.

@jreback
Copy link
Contributor Author

jreback commented Sep 18, 2014

The reason to raise on subtraction is that its almost always an unintended behavior. The user should have them in the same timezone. Yes you can auto-convert it, but I just think that's asking for trouble. Reindexing is the same thing, yes it can work (and maybe even is non-ambiguous), forget about transition times for a second. The problem is that it is almost never what you actually want and is likely an error.

cc @rockg
cc @sinhrks
cc @ischwabacher

@rockg
Copy link
Contributor

rockg commented Sep 18, 2014

I agree that always forcing the user to convert themselves for subtraction makes the most sense. Trying to guess what the user intends is always a frustrating endeavor for both the programmer and user.

For reindex if the indices are the same size then I think it's fine as it is essentially a replacement. But if you are using a method to fill in missing values, then it gets hairy. Maybe we need to look at a couple of examples to get on the same page.

@ischwabacher
Copy link
Contributor

My opinion made a complete about-face while I was writing this comment. I now think the reindex question is morally equivalent to this:

In [1]: import pandas as pd

In [2]: pd.Series(1, index=[1, 2, 3])
Out[2]: 
1    1
2    1
3    1
dtype: int64

In [3]: _.index
Out[3]: Int64Index([1, 2, 3], dtype='int64')

In [4]: _2.reindex([1., 2., 3.])
Out[4]: 
1    1
2    1
3    1
dtype: int64

In [5]: _.index
Out[5]: Float64Index([1.0, 2.0, 3.0], dtype='float64')

It's like any other coercion of one value to an equal value of a different type. Naive Timestamps are, of course, unequal to aware Timestamps, and the only contact between them should be via tz_localize (or tz_forget, if we ever decide to create such a thing).

@rockg
Copy link
Contributor

rockg commented Sep 18, 2014

tz_forget was just implemented tz_localize(None). I agree the coercion is fine but if you have an daily series in US/Eastern and you reindex with a UTC hourly index with method='pad' what is that supposed to do? Do you see changes in the new series at hour 4/5 (DST/not DST) or hour 0?

@jreback
Copy link
Contributor Author

jreback commented Sep 18, 2014

I am thinking we should simply raise if its not a completely non-ambiguous conversion, e.g.
if both/either are UTC/naive then its ok, otherwise raise?

or are you ok with reindex on an Index returns a tuple of new_index, indexer

[9] 'works' but is a bit odd (this is on master, I guess the suggestion is to make [9] work like [10]), but have tz='US/Eastern' in the result

In [9]: date_range('20130101',periods=5,tz='US/Eastern').reindex(date_range('20130101',periods=3))
Out[9]: 
(<class 'pandas.tseries.index.DatetimeIndex'>
 [2013-01-01, ..., 2013-01-03]
 Length: 3, Freq: D, Timezone: None, array([-1, -1, -1]))

In [10]: date_range('20130101',periods=5).reindex(date_range('20130101',periods=3))
Out[10]: 
(<class 'pandas.tseries.index.DatetimeIndex'>
 [2013-01-01, ..., 2013-01-03]
 Length: 3, Freq: D, Timezone: None, array([0, 1, 2]))

@immerrr
Copy link
Contributor

immerrr commented Sep 18, 2014

Speaking of metadata and dst transitions, what happens now if there's a date range that starts in summer time and ends in winter time?

@ischwabacher
Copy link
Contributor

Right, I was even paying attention to the tz_convert(None) issue. Herp derp.

I think since a DatetimeIndex is conceptually an array of Timestamps, each of which represents a precise moment in time, this conversion is pretty straightforward. I approve of this current behavior:

In [1]: import pandas as pd

In [2]: idx = pd.date_range('20131101', tz='America/Chicago', periods=7)

In [3]: newidx = pd.date_range('20131103', tz='UTC', periods=36, freq='H')

In [4]: pd.Series(range(7), index=idx).reindex(newidx, method='ffill')
Out[4]: 
2013-11-03 00:00:00+00:00    1
2013-11-03 01:00:00+00:00    1
2013-11-03 02:00:00+00:00    1
2013-11-03 03:00:00+00:00    1
2013-11-03 04:00:00+00:00    1
2013-11-03 05:00:00+00:00    2  # midnight before DST ends, offset = -0500
2013-11-03 06:00:00+00:00    2
2013-11-03 07:00:00+00:00    2
2013-11-03 08:00:00+00:00    2
2013-11-03 09:00:00+00:00    2
2013-11-03 10:00:00+00:00    2
2013-11-03 11:00:00+00:00    2
2013-11-03 12:00:00+00:00    2
2013-11-03 13:00:00+00:00    2
2013-11-03 14:00:00+00:00    2
2013-11-03 15:00:00+00:00    2
2013-11-03 16:00:00+00:00    2
2013-11-03 17:00:00+00:00    2
2013-11-03 18:00:00+00:00    2
2013-11-03 19:00:00+00:00    2
2013-11-03 20:00:00+00:00    2
2013-11-03 21:00:00+00:00    2
2013-11-03 22:00:00+00:00    2
2013-11-03 23:00:00+00:00    2
2013-11-04 00:00:00+00:00    2
2013-11-04 01:00:00+00:00    2
2013-11-04 02:00:00+00:00    2
2013-11-04 03:00:00+00:00    2
2013-11-04 04:00:00+00:00    2
2013-11-04 05:00:00+00:00    2
2013-11-04 06:00:00+00:00    3  # midnight after DST ends, offset = -0600
2013-11-04 07:00:00+00:00    3
2013-11-04 08:00:00+00:00    3
2013-11-04 09:00:00+00:00    3
2013-11-04 10:00:00+00:00    3
2013-11-04 11:00:00+00:00    3
Freq: H, dtype: int64

and disapprove of this current behavior:

In [24]: pd.Series(range(7), index=idx).reindex(newidx.tz_convert(None), method='ffill')
Out[24]: # Should raise
2013-11-03 00:00:00    1
2013-11-03 01:00:00    1
2013-11-03 02:00:00    1
2013-11-03 03:00:00    1
2013-11-03 04:00:00    1
2013-11-03 05:00:00    2
2013-11-03 06:00:00    2
2013-11-03 07:00:00    2
2013-11-03 08:00:00    2
2013-11-03 09:00:00    2
2013-11-03 10:00:00    2
2013-11-03 11:00:00    2
2013-11-03 12:00:00    2
2013-11-03 13:00:00    2
2013-11-03 14:00:00    2
2013-11-03 15:00:00    2
2013-11-03 16:00:00    2
2013-11-03 17:00:00    2
2013-11-03 18:00:00    2
2013-11-03 19:00:00    2
2013-11-03 20:00:00    2
2013-11-03 21:00:00    2
2013-11-03 22:00:00    2
2013-11-03 23:00:00    2
2013-11-04 00:00:00    2
2013-11-04 01:00:00    2
2013-11-04 02:00:00    2
2013-11-04 03:00:00    2
2013-11-04 04:00:00    2
2013-11-04 05:00:00    2
2013-11-04 06:00:00    3
2013-11-04 07:00:00    3
2013-11-04 08:00:00    3
2013-11-04 09:00:00    3
2013-11-04 10:00:00    3
2013-11-04 11:00:00    3
Freq: H, dtype: int64

Also, if we were to have the values change at midnight UTC, there might be the undesirable possibility of repeated conversions going "around the world" and ending up off by a day.

@jreback
Copy link
Contributor Author

jreback commented Sep 18, 2014

@ischwabacher interesting, so you are ok with reindexing a tz-aware with a UTC (explicity) series, but not a naive series. Maybe like tz_convert/localize should disallow a tz-aware with naive. (may go even further and require them to be the SAME tz). That way the rule is simple:

  • if both have same tz or are both naive -> ok
  • else raise

@ischwabacher
Copy link
Contributor

My stance on naive times is that they are bad and they should feel bad. They correspond to an oversimplified version of reality, and they will necessarily cause headaches if they are mixed with time-zone-aware objects. These headaches should happen right away, not in several months when a DST transition occurs and everything breaks. The naive reindex from the example interprets naive times as UTC, but the user should have to make that explicit.

On the other hand, the whole point of time zone awareness is that aware times refer to a specific point in time, and can be compared even if they're not in the same time zone. reindexing with a DatetimeIndex in a different time zone makes sense. Subtracting two aware times in different time zones makes sense. In these cases, there is a right thing to do, and we shouldn't make the user jump though hoops in order to do it.

Taking the union / intersection of collections of times in different time zones is more problematic, but I think the stdlib precedent should be good enough for us:

In [1]: {1} & {1.0}
Out[1]: {1.0}

In [2]: {1.0} & {1}
Out[2]: {1}

In [3]: {1} | {1.0}
Out[3]: {1}

In [4]: {1.0} | {1}
Out[4]: {1.0}

EDIT: Erm, numpy sets a conflicting but ultimately reasonable precedent:

In [1]: import numpy as np

In [2]: np.intersect1d(np.arange(3, dtype=int), np.arange(1, 4, dtype=float))
Out[2]: array([ 1.,  2.])

In [3]: np.intersect1d(np.arange(3, dtype=float), np.arange(1, 4, dtype=int))
Out[3]: array([ 1.,  2.])

In [4]: np.union1d(np.arange(3, dtype='i4'), np.arange(1, 4, dtype='i2'))
Out[4]: array([0, 1, 2, 3], dtype=int32)

In [5]: np.union1d(np.arange(3, dtype='i4'), np.arange(1, 4, dtype='i8'))
Out[5]: array([0, 1, 2, 3])

Conforming to this would mean converting all the times to UTC; conforming to the stdlib would mean taking either the first or the second time zone as appropriate.

@immerrr
Copy link
Contributor

immerrr commented Sep 19, 2014

@ischwabacher thank you, I found out something new today with those set examples :)

Numpy examples are interesting, too, but all of them exercise lossless casts (well, almost, in float64 there're 53 significand bits) which follow a simple set of rules that are surprisingly unambiguous, which you can see for yourself by doing something like:

In [32]: for i,c1 in enumerate( np.typecodes['All']):
    for c2 in np.typecodes['All'][i+1:]:
        if c1 != c2 and np.can_cast(c1, c2) and np.can_cast(c2, c1):
            print '{d1}({c1})<=>{d2}({c2})'.format(c1=c1,d1=np.dtype(c1), c2=c2,d2=np.dtype(c2))
   ....:             
int64(l)<=>int64(q)
int64(l)<=>int64(p)
int64(q)<=>int64(p)
uint64(L)<=>uint64(Q)
uint64(L)<=>uint64(P)
uint64(Q)<=>uint64(P)
|V0(V)<=>object(O)

In numpy only type aliases are mutually castable and the rest of the relations are one-sided. Which departs somewhat from our case because tz conversions are lossless in both directions and thus can be seen as ambiguous. Declaring UTC as a "superclass" might work but it would come as a surprise for me at least, like if I'd put first some euro and then some pounds to my pocket and they come out as dollars instead.

@ischwabacher
Copy link
Contributor

thank you, I found out something new today with those set examples :)

So did I; I didn't know what was going to come out until I tried them. :)

I like your argument about there not being distinct mutually castable types, but I think it's really important not to have friction when handling time-zone-aware objects. Then again, most of the friction I've been experiencing has been caused by my local Timestamps getting coerced into UTC datetime64s...

If numpy disallows intercoercible types because it has a clever algorithm for computing output dtype from input dtype that doesn't work in the presence of cycles in the coercion graph, and we want to appropriate this algorithm, then that would be a pretty strong argument against allowing coercion across time zones, except possibly to UTC. But maybe it's not too surprising that when you put Lire, Francs and Marks together in your pocket, they come out as Euros?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: Next Major Release, 0.23.2 Jun 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Error Reporting Incorrect or improved errors from pandas Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants