Closed
Description
Code Sample, a copy-pastable example if possible
As of 0.20, DataFrame.to_records
will use the unicode
type for the all dtype identifiers on python 2.
In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
dtype=[(u'index', '<i8'), (u'c/s', '<i8'), (u'c/\u03c3', '<i8')])
This caused some issues for statsmodels, since they go to_records().dtype
-> np.dtype
, which doesn't like unicode identifiers on python2 (statsmodels/statsmodels#3658 (comment))
I think the correct behavior is to just use whatever the user has. So the output from above should be
In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
dtype=[('index', '<i8'), ('c/s', '<i8'), (u'c/\u03c3', '<i8')])
so the python2 str
column (which is actually bytes) should just be 'c/s'
, not u'c/s'
.
This thing pandas has to decide is how to handle
- the default
'index'
when df.index.name is None - non-string columns like numbers
I think the least-surprising there is to use str()
, so on py2 that will be bytes, and on py3 it will be unicode. Not sure if it will cause problems elsewhere though.