Skip to content

DataFrame.to_records dtype shouldn't use unicode for every column #16358

Closed
@TomAugspurger

Description

@TomAugspurger

Code Sample, a copy-pastable example if possible

As of 0.20, DataFrame.to_records will use the unicode type for the all dtype identifiers on python 2.

In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
          dtype=[(u'index', '<i8'), (u'c/s', '<i8'), (u'c/\u03c3', '<i8')])

This caused some issues for statsmodels, since they go to_records().dtype -> np.dtype, which doesn't like unicode identifiers on python2 (statsmodels/statsmodels#3658 (comment))

I think the correct behavior is to just use whatever the user has. So the output from above should be

In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
          dtype=[('index', '<i8'), ('c/s', '<i8'), (u'c/\u03c3', '<i8')])

so the python2 str column (which is actually bytes) should just be 'c/s', not u'c/s'.

This thing pandas has to decide is how to handle

  1. the default 'index' when df.index.name is None
  2. non-string columns like numbers

I think the least-surprising there is to use str(), so on py2 that will be bytes, and on py3 it will be unicode. Not sure if it will cause problems elsewhere though.

xref #13462 and #11879

cc @AlexisMignon

Metadata

Metadata

Assignees

No one assigned

    Labels

    UnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions