Skip to content

"IndexError: tuple index out of range" error with numpy array contain datetimes #15869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
richard-bibb opened this issue Apr 2, 2017 · 6 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@richard-bibb
Copy link

Code Sample, a copy-pastable example if possible

>>> from datetime import datetime as dt
>>> import numpy as np
>>> from pandas import DataFrame
>>> d=np.array([None, None, None, None, dt.now(), None])
>>> b = DataFrame(d)

Problem description

The above code works in an old version of pandas (0.7.3) but fails in the current version (0.19.2)

Expected Output

No error and dataframe containing the numpy date array

Output of pd.show_versions()

>>> pandas.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 20.10.1
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jreback jreback added Bug Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 2, 2017
@jreback jreback added this to the Next Minor Release milestone Apr 2, 2017
@chrisaycock
Copy link
Contributor

Interestingly, this works:

pd.DataFrame(np.array([None, None, dt.now(), None]))

But this does not:

pd.DataFrame(np.array([None, None, None, dt.now(), None]))

Numpy doesn't seem to handle the input arrays any differently, but create_block_manager_from_blocks() definitely shows different blocks. The first is

[array([['NaT', 'NaT', '2017-04-04T13:54:03.236544000', 'NaT']], dtype='datetime64[ns]')]

while the second is

[array([None, None, None, datetime.datetime(2017, 4, 4, 13, 54, 10, 804563),
       None], dtype=object)]

I'm going to dig some more into this.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

@chrisaycock here's why.

https://github.com/pandas-dev/pandas/blob/master/pandas/types/cast.py#L810

When we have a 1-d array that is object we want to see whether its datetimelike or not. So I did this old thing where I would sample the first 3 elements. This is bogus now as can simply do this (also need to tests for timedelta as well).

In [1]: from pandas._libs.lib import is_datetime_array

In [2]: is_datetime_array(np.array([None, None, datetime.datetime.now()]))
Out[2]: True

In [3]: is_datetime_array(np.array([None, None, None, datetime.datetime.now()]))
Out[3]: True

It might be better to simply modify is_possible_datetimelike_array as an alternative (it is called below as well).

These routines are robust to null values, check for the requested (datetime or timedelta-like), and don't read the entire routine if is just strings (which is the point here, we are trying to see w/o doing too much work if we can coerce this). This is called for every object array on construction, so this needs to be cheap.

This may seem like overly paranoid / extreme. But remember we can actually have mixed dtypes that get coerced, e.g.:

In [4]: Series(['NaT', None, pd.Timestamp('20130101'), datetime.datetime.now(), '20150101'])
Out[4]: 
0                          NaT
1                          NaT
2   2013-01-01 00:00:00.000000
3   2017-04-04 14:07:52.743198
4   2015-01-01 00:00:00.000000
dtype: datetime64[ns]

but we need it be robust to non-matching types.

In [6]: Series(['NaT', None, pd.Timestamp('20130101'), datetime.datetime.now(), '20150101', np.nan, pd.Timedelta('1 day')])
Out[6]: 
0                           NaT
1                          None
2           2013-01-01 00:00:00
3    2017-04-04 14:08:34.334113
4                      20150101
5                           NaN
6               1 days 00:00:00
dtype: object

@chrisaycock
Copy link
Contributor

chrisaycock commented Apr 4, 2017

@jreback Agreed that the culprit is the "quick inference" on the first three elements.

Is your proposal to modify maybe_infer_to_datetimelike() to use is_possible_datetimelike_array() instead of the logic that begins with the "sample"? Are there other gotchas given all of the conditionals below it?

The alternative is to infer_dtype() the entire array instead of the first three elements. Would the performance impact be just too much then?

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

i think maybe best way is to trash all of the logic for the sample and below

then modify: is_possible_datetimelike_array to return 3 states (string is prob fine)

  • datetime
  • timedelta
  • mixed

the u know if u should attempt to_datetime or to_timedelta or are done

it short circuits so should be performant
though maybe inside that routine you can take the first 3 valid points (non null) and actually try to convert them - if that fails then you are done

the issue is to try to not iterate (and convert) all strings

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

@chrisaycock this was a can of worms, see #15892

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017
@jreback jreback closed this as completed in e0b60c0 Apr 4, 2017
@richard-bibb
Copy link
Author

Thanks for the quick fix guys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants