Skip to content

DOC: Added note to io.rst regarding reading in mixed dtypes #13782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1751,6 +1751,8 @@ Convert a subset of columns to a specified type using :meth:`~DataFrame.astype`
dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)
dft.dtypes

.. _basics.object_conversion:

object conversion
~~~~~~~~~~~~~~~~~

Expand Down
60 changes: 60 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -435,11 +435,71 @@ individual columns:
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
df.dtypes

Fortunately, ``pandas`` offers more than one way to ensure that your column(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in here a reference to basics.dtypes. learn more about dtypes here.

contain only one ``dtype``. If you're unfamiliar with these concepts, you can
see :ref:`here<basics.dtypes>` to learn more about dtypes, and
:ref:`here<basics.object_conversion>` to learn more about ``object`` conversion in
``pandas``.


For instance, you can use the ``converters`` argument
of :func:`~pandas.read_csv`:

.. ipython:: python

data = "col_1\n1\n2\n'A'\n4.22"
df = pd.read_csv(StringIO(data), converters={'col_1':str})
df
df['col_1'].apply(type).value_counts()

Or you can use the :func:`~pandas.to_numeric` function to coerce the
dtypes after reading in the data,

.. ipython:: python

df2 = pd.read_csv(StringIO(data))
df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')
df2
df2['col_1'].apply(type).value_counts()

which would convert all valid parsing to floats, leaving the invalid parsing
as ``NaN``.

Ultimately, how you deal with reading in columns containing mixed dtypes
depends on your specific needs. In the case above, if you wanted to ``NaN`` out
the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
However, if you wanted for all the data to be coerced, no matter the type, then
using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be
worth trying.

.. note::
The ``dtype`` option is currently only supported by the C engine.
Specifying ``dtype`` with ``engine`` other than 'c' raises a
``ValueError``.

.. note::
In some cases, reading in abnormal data with columns containing mixed dtypes
will result in an inconsistent dataset. If you rely on pandas to infer the
dtypes of your columns, the parsing engine will go and infer the dtypes for
different chunks of the data, rather than the whole dataset at once. Consequently,
you can end up with column(s) with mixed dtypes. For example,

.. ipython:: python
:okwarning:

df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
df.to_csv('foo')
mixed_df = pd.read_csv('foo')
mixed_df['col_1'].apply(type).value_counts()
mixed_df['col_1'].dtype

will result with `mixed_df` containing an ``int`` dtype for certain chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to a problem during parsing.

explain what you actually mean here. There wasn't a problem, but the data instead had mixed dtypes.

of the column, and ``str`` for others due to the mixed dtypes from the
data that was read in. It is important to note that the overall column will be
marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.



Naming and Using Columns
''''''''''''''''''''''''

Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,8 @@ Other enhancements
index=['row1', 'row2'])
df.sort_values(by='row2', axis=1)

- Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)

.. _whatsnew_0190.api:


Expand Down