From 335d043be20ef949a4dba63d74c02597a01efa66 Mon Sep 17 00:00:00 2001 From: wcwagner Date: Sun, 24 Jul 2016 22:31:49 -0400 Subject: [PATCH 1/5] DOC: Added note to io.rst regarding reading in mixed dtypes --- doc/source/io.rst | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/doc/source/io.rst b/doc/source/io.rst index 86da2561a36be..79d7280dbb4fe 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -440,6 +440,45 @@ individual columns: Specifying ``dtype`` with ``engine`` other than 'c' raises a ``ValueError``. +.. note:: + + Reading in data with mixed dtypes and relying on ``pandas`` + to infer them is not recommended. In doing so, the parsing engine will + loop over all the dtypes, trying to convert them to an actual + type; if something breaks during that process, the engine will go to the + next ``dtype`` and the data is left modified in place. For example, + + .. ipython:: python + + from collections import Counter + df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)}) + df.to_csv('foo') + mixed_df = pd.read_csv('foo') + Counter(mixed_df['col_1'].apply(lambda x: type(x))) + + will result with `mixed_df` containing an ``int`` dtype for the first + 262,143 values, and ``str`` for others due to a problem during + parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s) + contain only one ``dtype``. For instance, you could use the ``converters`` + argument of :func:`~pandas.read_csv` + + .. ipython:: python + + fixed_df1 = pd.read_csv('foo', converters={'col_1':str}) + Counter(fixed_df1['col_1'].apply(lambda x: type(x))) + + Or you could use the :func:`~pandas.to_numeric` function to coerce the + dtypes after reading in the data, + + .. ipython:: python + + fixed_df2 = pd.read_csv('foo') + fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce') + Counter(fixed_df2['col_1'].apply(lambda x: type(x))) + + which would convert all valid parsing to ints, leaving the invalid parsing + as ``NaN``. + Naming and Using Columns '''''''''''''''''''''''' From b6e2b64607ee1268e4ed88b7a0fe36a51fa2f4c7 Mon Sep 17 00:00:00 2001 From: wcwagner Date: Mon, 25 Jul 2016 10:09:20 -0400 Subject: [PATCH 2/5] DOC: Swtiched Counter to value_counts, added low_memory alternative example, clarified type inference process --- doc/source/io.rst | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/doc/source/io.rst b/doc/source/io.rst index 79d7280dbb4fe..59c1ea9384809 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -444,41 +444,55 @@ individual columns: Reading in data with mixed dtypes and relying on ``pandas`` to infer them is not recommended. In doing so, the parsing engine will - loop over all the dtypes, trying to convert them to an actual - type; if something breaks during that process, the engine will go to the - next ``dtype`` and the data is left modified in place. For example, + infer the dtypes for different chunks of the data, rather than the whole + dataset at once. Consequently, you can end up with column(s) with mixed + dtypes. For example, .. ipython:: python + :okwarning: - from collections import Counter df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)}) df.to_csv('foo') mixed_df = pd.read_csv('foo') - Counter(mixed_df['col_1'].apply(lambda x: type(x))) + mixed_df['col_1'].apply(lambda x: type(x)).value_counts() + mixed_df['col_1'].dtype will result with `mixed_df` containing an ``int`` dtype for the first 262,143 values, and ``str`` for others due to a problem during - parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s) + parsing. It is important to note that the overall column will be marked with a + ``dtype`` of ``object``, which is used for columns with mixed dtypes. + + Fortunately, ``pandas`` offers a few ways to ensure that the column(s) contain only one ``dtype``. For instance, you could use the ``converters`` argument of :func:`~pandas.read_csv` .. ipython:: python fixed_df1 = pd.read_csv('foo', converters={'col_1':str}) - Counter(fixed_df1['col_1'].apply(lambda x: type(x))) + fixed_df1['col_1'].apply(lambda x: type(x)).value_counts() Or you could use the :func:`~pandas.to_numeric` function to coerce the dtypes after reading in the data, .. ipython:: python + :okwarning: fixed_df2 = pd.read_csv('foo') fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce') - Counter(fixed_df2['col_1'].apply(lambda x: type(x))) + fixed_df2['col_1'].apply(lambda x: type(x)).value_counts() which would convert all valid parsing to ints, leaving the invalid parsing as ``NaN``. + Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv` + to ``False``. Such as, + + .. ipython:: python + + fixed_df3 = pd.read_csv('foo', low_memory=False) + fixed_df2['col_1'].apply(lambda x: type(x)).value_counts() + + which achieves a similar result. Naming and Using Columns '''''''''''''''''''''''' From ba4c2ced5a0544f6b5378d54b42e4cba5646825a Mon Sep 17 00:00:00 2001 From: wcwagner Date: Mon, 25 Jul 2016 20:06:31 -0400 Subject: [PATCH 3/5] DOC: Added short commentary on alternatives --- doc/source/io.rst | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/doc/source/io.rst b/doc/source/io.rst index 59c1ea9384809..6823bd05000fd 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -442,11 +442,11 @@ individual columns: .. note:: - Reading in data with mixed dtypes and relying on ``pandas`` - to infer them is not recommended. In doing so, the parsing engine will - infer the dtypes for different chunks of the data, rather than the whole - dataset at once. Consequently, you can end up with column(s) with mixed - dtypes. For example, + Reading in data with columns containing mixed dtypes and relying + on ``pandas`` to infer them is not recommended. In doing so, the + parsing engine will infer the dtypes for different chunks of the data, + rather than the whole dataset at once. Consequently, you can end up with + column(s) with mixed dtypes. For example, .. ipython:: python :okwarning: @@ -454,12 +454,12 @@ individual columns: df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)}) df.to_csv('foo') mixed_df = pd.read_csv('foo') - mixed_df['col_1'].apply(lambda x: type(x)).value_counts() + mixed_df['col_1'].apply(type).value_counts() mixed_df['col_1'].dtype - will result with `mixed_df` containing an ``int`` dtype for the first - 262,143 values, and ``str`` for others due to a problem during - parsing. It is important to note that the overall column will be marked with a + will result with `mixed_df` containing an ``int`` dtype for certain chunks + of the column, and ``str`` for others due to a problem during parsing. + It is important to note that the overall column will be marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes. Fortunately, ``pandas`` offers a few ways to ensure that the column(s) @@ -469,7 +469,7 @@ individual columns: .. ipython:: python fixed_df1 = pd.read_csv('foo', converters={'col_1':str}) - fixed_df1['col_1'].apply(lambda x: type(x)).value_counts() + fixed_df1['col_1'].apply(type).value_counts() Or you could use the :func:`~pandas.to_numeric` function to coerce the dtypes after reading in the data, @@ -479,9 +479,9 @@ individual columns: fixed_df2 = pd.read_csv('foo') fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce') - fixed_df2['col_1'].apply(lambda x: type(x)).value_counts() + fixed_df2['col_1'].apply(type).value_counts() - which would convert all valid parsing to ints, leaving the invalid parsing + which would convert all valid parsing to floats, leaving the invalid parsing as ``NaN``. Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv` @@ -490,9 +490,14 @@ individual columns: .. ipython:: python fixed_df3 = pd.read_csv('foo', low_memory=False) - fixed_df2['col_1'].apply(lambda x: type(x)).value_counts() + fixed_df3['col_1'].apply(type).value_counts() + + Ultimately, how you deal with reading in columns containing mixed dtypes + depends on your specific needs. In the case above, if you wanted to ``NaN`` out + the data anomalies, then :func:`~pandas.to_numeric` is probably your best option. + However, if you wanted for all the data to be coerced, no matter the type, then + using the ``converters`` argument of :func:`~pandas.read_csv` would certainly work. - which achieves a similar result. Naming and Using Columns '''''''''''''''''''''''' From 8112ad5eb6098c3a140ae8badb6e0da8e8c3ae50 Mon Sep 17 00:00:00 2001 From: wcwagner Date: Tue, 26 Jul 2016 11:31:46 -0400 Subject: [PATCH 4/5] DOC: Shortened note, moved alternatives to main text --- doc/source/io.rst | 82 ++++++++++++++++++++++------------------------- 1 file changed, 39 insertions(+), 43 deletions(-) diff --git a/doc/source/io.rst b/doc/source/io.rst index 6823bd05000fd..cc5b17fcd1464 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -435,18 +435,48 @@ individual columns: df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64}) df.dtypes +Fortunately, ``pandas`` offers more than one way to ensure that your column(s) +contain only one ``dtype``. For instance, you can use the ``converters`` argument +of :func:`~pandas.read_csv`: + +.. ipython:: python + + data = "col_1\n1\n2\n'A'\n4.22" + df = pd.read_csv(StringIO(data), converters={'col_1':str}) + df + df['col_1'].apply(type).value_counts() + +Or you can use the :func:`~pandas.to_numeric` function to coerce the +dtypes after reading in the data, + +.. ipython:: python + + df2 = pd.read_csv(StringIO(data)) + df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce') + df2 + df2['col_1'].apply(type).value_counts() + +which would convert all valid parsing to floats, leaving the invalid parsing +as ``NaN``. + +Ultimately, how you deal with reading in columns containing mixed dtypes +depends on your specific needs. In the case above, if you wanted to ``NaN`` out +the data anomalies, then :func:`~pandas.to_numeric` is probably your best option. +However, if you wanted for all the data to be coerced, no matter the type, then +using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be +worth trying. + .. note:: The ``dtype`` option is currently only supported by the C engine. Specifying ``dtype`` with ``engine`` other than 'c' raises a ``ValueError``. .. note:: - - Reading in data with columns containing mixed dtypes and relying - on ``pandas`` to infer them is not recommended. In doing so, the - parsing engine will infer the dtypes for different chunks of the data, - rather than the whole dataset at once. Consequently, you can end up with - column(s) with mixed dtypes. For example, + In some cases, reading in abnormal data with columns containing mixed dtypes + will result in an inconsistent dataset. If you rely on pandas to infer the + dtypes of your columns, the parsing engine will go and infer the dtypes for + different chunks of the data, rather than the whole dataset at once. Consequently, + you can end up with column(s) with mixed dtypes. For example, .. ipython:: python :okwarning: @@ -458,45 +488,11 @@ individual columns: mixed_df['col_1'].dtype will result with `mixed_df` containing an ``int`` dtype for certain chunks - of the column, and ``str`` for others due to a problem during parsing. - It is important to note that the overall column will be marked with a - ``dtype`` of ``object``, which is used for columns with mixed dtypes. - - Fortunately, ``pandas`` offers a few ways to ensure that the column(s) - contain only one ``dtype``. For instance, you could use the ``converters`` - argument of :func:`~pandas.read_csv` - - .. ipython:: python - - fixed_df1 = pd.read_csv('foo', converters={'col_1':str}) - fixed_df1['col_1'].apply(type).value_counts() - - Or you could use the :func:`~pandas.to_numeric` function to coerce the - dtypes after reading in the data, - - .. ipython:: python - :okwarning: - - fixed_df2 = pd.read_csv('foo') - fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce') - fixed_df2['col_1'].apply(type).value_counts() - - which would convert all valid parsing to floats, leaving the invalid parsing - as ``NaN``. - - Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv` - to ``False``. Such as, - - .. ipython:: python + of the column, and ``str`` for others due to the mixed dtypes from the + data that was read in. It is important to note that the overall column will be + marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes. - fixed_df3 = pd.read_csv('foo', low_memory=False) - fixed_df3['col_1'].apply(type).value_counts() - Ultimately, how you deal with reading in columns containing mixed dtypes - depends on your specific needs. In the case above, if you wanted to ``NaN`` out - the data anomalies, then :func:`~pandas.to_numeric` is probably your best option. - However, if you wanted for all the data to be coerced, no matter the type, then - using the ``converters`` argument of :func:`~pandas.read_csv` would certainly work. Naming and Using Columns '''''''''''''''''''''''' From 7400607fa5d9ccef19a826d4581c65d393b4e237 Mon Sep 17 00:00:00 2001 From: wcwagner Date: Tue, 26 Jul 2016 20:29:45 -0400 Subject: [PATCH 5/5] DOC: Added refs to basics.dtypes and basics.object_conversion, added whatsnew entry --- doc/source/basics.rst | 2 ++ doc/source/io.rst | 8 +++++++- doc/source/whatsnew/v0.19.0.txt | 2 ++ 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/doc/source/basics.rst b/doc/source/basics.rst index 63a7c8fded2db..1f670fb7fb593 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -1751,6 +1751,8 @@ Convert a subset of columns to a specified type using :meth:`~DataFrame.astype` dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8) dft.dtypes +.. _basics.object_conversion: + object conversion ~~~~~~~~~~~~~~~~~ diff --git a/doc/source/io.rst b/doc/source/io.rst index cc5b17fcd1464..e3b03b5a39b37 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -436,7 +436,13 @@ individual columns: df.dtypes Fortunately, ``pandas`` offers more than one way to ensure that your column(s) -contain only one ``dtype``. For instance, you can use the ``converters`` argument +contain only one ``dtype``. If you're unfamiliar with these concepts, you can +see :ref:`here` to learn more about dtypes, and +:ref:`here` to learn more about ``object`` conversion in +``pandas``. + + +For instance, you can use the ``converters`` argument of :func:`~pandas.read_csv`: .. ipython:: python diff --git a/doc/source/whatsnew/v0.19.0.txt b/doc/source/whatsnew/v0.19.0.txt index 06625e09d70a1..86d60ca48ea6e 100644 --- a/doc/source/whatsnew/v0.19.0.txt +++ b/doc/source/whatsnew/v0.19.0.txt @@ -323,6 +323,8 @@ Other enhancements index=['row1', 'row2']) df.sort_values(by='row2', axis=1) +- Added documentation to :ref:`I/O` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`) + .. _whatsnew_0190.api: