Skip to content

Commit 8112ad5

Browse files
committed
DOC: Shortened note, moved alternatives to main text
1 parent ba4c2ce commit 8112ad5

File tree

1 file changed

+39
-43
lines changed

1 file changed

+39
-43
lines changed

doc/source/io.rst

Lines changed: 39 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -435,18 +435,48 @@ individual columns:
435435
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
436436
df.dtypes
437437
438+
Fortunately, ``pandas`` offers more than one way to ensure that your column(s)
439+
contain only one ``dtype``. For instance, you can use the ``converters`` argument
440+
of :func:`~pandas.read_csv`:
441+
442+
.. ipython:: python
443+
444+
data = "col_1\n1\n2\n'A'\n4.22"
445+
df = pd.read_csv(StringIO(data), converters={'col_1':str})
446+
df
447+
df['col_1'].apply(type).value_counts()
448+
449+
Or you can use the :func:`~pandas.to_numeric` function to coerce the
450+
dtypes after reading in the data,
451+
452+
.. ipython:: python
453+
454+
df2 = pd.read_csv(StringIO(data))
455+
df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')
456+
df2
457+
df2['col_1'].apply(type).value_counts()
458+
459+
which would convert all valid parsing to floats, leaving the invalid parsing
460+
as ``NaN``.
461+
462+
Ultimately, how you deal with reading in columns containing mixed dtypes
463+
depends on your specific needs. In the case above, if you wanted to ``NaN`` out
464+
the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
465+
However, if you wanted for all the data to be coerced, no matter the type, then
466+
using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be
467+
worth trying.
468+
438469
.. note::
439470
The ``dtype`` option is currently only supported by the C engine.
440471
Specifying ``dtype`` with ``engine`` other than 'c' raises a
441472
``ValueError``.
442473

443474
.. note::
444-
445-
Reading in data with columns containing mixed dtypes and relying
446-
on ``pandas`` to infer them is not recommended. In doing so, the
447-
parsing engine will infer the dtypes for different chunks of the data,
448-
rather than the whole dataset at once. Consequently, you can end up with
449-
column(s) with mixed dtypes. For example,
475+
In some cases, reading in abnormal data with columns containing mixed dtypes
476+
will result in an inconsistent dataset. If you rely on pandas to infer the
477+
dtypes of your columns, the parsing engine will go and infer the dtypes for
478+
different chunks of the data, rather than the whole dataset at once. Consequently,
479+
you can end up with column(s) with mixed dtypes. For example,
450480

451481
.. ipython:: python
452482
:okwarning:
@@ -458,45 +488,11 @@ individual columns:
458488
mixed_df['col_1'].dtype
459489
460490
will result with `mixed_df` containing an ``int`` dtype for certain chunks
461-
of the column, and ``str`` for others due to a problem during parsing.
462-
It is important to note that the overall column will be marked with a
463-
``dtype`` of ``object``, which is used for columns with mixed dtypes.
464-
465-
Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
466-
contain only one ``dtype``. For instance, you could use the ``converters``
467-
argument of :func:`~pandas.read_csv`
468-
469-
.. ipython:: python
470-
471-
fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
472-
fixed_df1['col_1'].apply(type).value_counts()
473-
474-
Or you could use the :func:`~pandas.to_numeric` function to coerce the
475-
dtypes after reading in the data,
476-
477-
.. ipython:: python
478-
:okwarning:
479-
480-
fixed_df2 = pd.read_csv('foo')
481-
fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
482-
fixed_df2['col_1'].apply(type).value_counts()
483-
484-
which would convert all valid parsing to floats, leaving the invalid parsing
485-
as ``NaN``.
486-
487-
Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv`
488-
to ``False``. Such as,
489-
490-
.. ipython:: python
491+
of the column, and ``str`` for others due to the mixed dtypes from the
492+
data that was read in. It is important to note that the overall column will be
493+
marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.
491494

492-
fixed_df3 = pd.read_csv('foo', low_memory=False)
493-
fixed_df3['col_1'].apply(type).value_counts()
494495

495-
Ultimately, how you deal with reading in columns containing mixed dtypes
496-
depends on your specific needs. In the case above, if you wanted to ``NaN`` out
497-
the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
498-
However, if you wanted for all the data to be coerced, no matter the type, then
499-
using the ``converters`` argument of :func:`~pandas.read_csv` would certainly work.
500496

501497
Naming and Using Columns
502498
''''''''''''''''''''''''

0 commit comments

Comments
 (0)