From 335d043be20ef949a4dba63d74c02597a01efa66 Mon Sep 17 00:00:00 2001
From: wcwagner <wcw13@my.fsu.edu>
Date: Sun, 24 Jul 2016 22:31:49 -0400
Subject: [PATCH 1/5] DOC: Added note to io.rst regarding reading in mixed
 dtypes

---
 doc/source/io.rst | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/doc/source/io.rst b/doc/source/io.rst
index 86da2561a36be..79d7280dbb4fe 100644
--- a/doc/source/io.rst
+++ b/doc/source/io.rst
@@ -440,6 +440,45 @@ individual columns:
     Specifying ``dtype`` with ``engine`` other than 'c' raises a
     ``ValueError``.
 
+.. note::
+
+   Reading in data with mixed dtypes and relying on ``pandas``
+   to infer them is not recommended. In doing so, the parsing engine will
+   loop over all the dtypes, trying to convert them to an actual
+   type; if something breaks during that process, the engine will go to the
+   next ``dtype`` and the data is left modified in place. For example,
+
+   .. ipython:: python
+
+       from collections import Counter
+       df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
+       df.to_csv('foo')
+       mixed_df = pd.read_csv('foo')
+       Counter(mixed_df['col_1'].apply(lambda x: type(x)))
+
+   will result with `mixed_df` containing an ``int`` dtype for the first
+   262,143 values, and ``str`` for others due to a problem during
+   parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
+   contain only one ``dtype``. For instance, you could use the ``converters``
+   argument of :func:`~pandas.read_csv`
+
+   .. ipython:: python
+
+       fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
+       Counter(fixed_df1['col_1'].apply(lambda x: type(x)))
+
+   Or you could use the :func:`~pandas.to_numeric` function to coerce the
+   dtypes after reading in the data,
+
+   .. ipython:: python
+
+       fixed_df2 = pd.read_csv('foo')
+       fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
+       Counter(fixed_df2['col_1'].apply(lambda x: type(x)))
+
+   which would convert all valid parsing to ints, leaving the invalid parsing
+   as ``NaN``.
+
 Naming and Using Columns
 ''''''''''''''''''''''''
 

From b6e2b64607ee1268e4ed88b7a0fe36a51fa2f4c7 Mon Sep 17 00:00:00 2001
From: wcwagner <wcw13@my.fsu.edu>
Date: Mon, 25 Jul 2016 10:09:20 -0400
Subject: [PATCH 2/5] DOC: Swtiched Counter to value_counts, added low_memory
 alternative example, clarified type inference process

---
 doc/source/io.rst | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/doc/source/io.rst b/doc/source/io.rst
index 79d7280dbb4fe..59c1ea9384809 100644
--- a/doc/source/io.rst
+++ b/doc/source/io.rst
@@ -444,41 +444,55 @@ individual columns:
 
    Reading in data with mixed dtypes and relying on ``pandas``
    to infer them is not recommended. In doing so, the parsing engine will
-   loop over all the dtypes, trying to convert them to an actual
-   type; if something breaks during that process, the engine will go to the
-   next ``dtype`` and the data is left modified in place. For example,
+   infer the dtypes for different chunks of the data, rather than the whole
+   dataset at once. Consequently, you can end up with column(s) with mixed
+   dtypes. For example,
 
    .. ipython:: python
+         :okwarning:
 
-       from collections import Counter
        df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
        df.to_csv('foo')
        mixed_df = pd.read_csv('foo')
-       Counter(mixed_df['col_1'].apply(lambda x: type(x)))
+       mixed_df['col_1'].apply(lambda x: type(x)).value_counts()
+       mixed_df['col_1'].dtype
 
    will result with `mixed_df` containing an ``int`` dtype for the first
    262,143 values, and ``str`` for others due to a problem during
-   parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
+   parsing. It is important to note that the overall column will be marked with a
+   ``dtype`` of ``object``, which is used for columns with mixed dtypes.
+
+   Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
    contain only one ``dtype``. For instance, you could use the ``converters``
    argument of :func:`~pandas.read_csv`
 
    .. ipython:: python
 
        fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
-       Counter(fixed_df1['col_1'].apply(lambda x: type(x)))
+       fixed_df1['col_1'].apply(lambda x: type(x)).value_counts()
 
    Or you could use the :func:`~pandas.to_numeric` function to coerce the
    dtypes after reading in the data,
 
    .. ipython:: python
+         :okwarning:
 
        fixed_df2 = pd.read_csv('foo')
        fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
-       Counter(fixed_df2['col_1'].apply(lambda x: type(x)))
+       fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
 
    which would convert all valid parsing to ints, leaving the invalid parsing
    as ``NaN``.
 
+   Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv`
+   to ``False``. Such as,
+
+   .. ipython:: python
+
+      fixed_df3 = pd.read_csv('foo', low_memory=False)
+      fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
+
+   which achieves a similar result.
 Naming and Using Columns
 ''''''''''''''''''''''''
 

From ba4c2ced5a0544f6b5378d54b42e4cba5646825a Mon Sep 17 00:00:00 2001
From: wcwagner <wcw13@my.fsu.edu>
Date: Mon, 25 Jul 2016 20:06:31 -0400
Subject: [PATCH 3/5] DOC: Added short commentary on alternatives

---
 doc/source/io.rst | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/doc/source/io.rst b/doc/source/io.rst
index 59c1ea9384809..6823bd05000fd 100644
--- a/doc/source/io.rst
+++ b/doc/source/io.rst
@@ -442,11 +442,11 @@ individual columns:
 
 .. note::
 
-   Reading in data with mixed dtypes and relying on ``pandas``
-   to infer them is not recommended. In doing so, the parsing engine will
-   infer the dtypes for different chunks of the data, rather than the whole
-   dataset at once. Consequently, you can end up with column(s) with mixed
-   dtypes. For example,
+   Reading in data with columns containing mixed dtypes and relying
+   on ``pandas`` to infer them is not recommended. In doing so, the
+   parsing engine will infer the dtypes for different chunks of the data,
+   rather than the whole dataset at once. Consequently, you can end up with
+   column(s) with mixed dtypes. For example,
 
    .. ipython:: python
          :okwarning:
@@ -454,12 +454,12 @@ individual columns:
        df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
        df.to_csv('foo')
        mixed_df = pd.read_csv('foo')
-       mixed_df['col_1'].apply(lambda x: type(x)).value_counts()
+       mixed_df['col_1'].apply(type).value_counts()
        mixed_df['col_1'].dtype
 
-   will result with `mixed_df` containing an ``int`` dtype for the first
-   262,143 values, and ``str`` for others due to a problem during
-   parsing. It is important to note that the overall column will be marked with a
+   will result with `mixed_df` containing an ``int`` dtype for certain chunks
+   of the column, and ``str`` for others due to a problem during parsing.
+   It is important to note that the overall column will be marked with a
    ``dtype`` of ``object``, which is used for columns with mixed dtypes.
 
    Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
@@ -469,7 +469,7 @@ individual columns:
    .. ipython:: python
 
        fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
-       fixed_df1['col_1'].apply(lambda x: type(x)).value_counts()
+       fixed_df1['col_1'].apply(type).value_counts()
 
    Or you could use the :func:`~pandas.to_numeric` function to coerce the
    dtypes after reading in the data,
@@ -479,9 +479,9 @@ individual columns:
 
        fixed_df2 = pd.read_csv('foo')
        fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
-       fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
+       fixed_df2['col_1'].apply(type).value_counts()
 
-   which would convert all valid parsing to ints, leaving the invalid parsing
+   which would convert all valid parsing to floats, leaving the invalid parsing
    as ``NaN``.
 
    Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv`
@@ -490,9 +490,14 @@ individual columns:
    .. ipython:: python
 
       fixed_df3 = pd.read_csv('foo', low_memory=False)
-      fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
+      fixed_df3['col_1'].apply(type).value_counts()
+
+   Ultimately, how you deal with reading in columns containing mixed dtypes
+   depends on your specific needs. In the case above, if you wanted to ``NaN`` out
+   the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
+   However, if you wanted for all the data to be coerced, no matter the type, then
+   using the ``converters`` argument of :func:`~pandas.read_csv` would certainly work.
 
-   which achieves a similar result.
 Naming and Using Columns
 ''''''''''''''''''''''''
 

From 8112ad5eb6098c3a140ae8badb6e0da8e8c3ae50 Mon Sep 17 00:00:00 2001
From: wcwagner <wcw13@my.fsu.edu>
Date: Tue, 26 Jul 2016 11:31:46 -0400
Subject: [PATCH 4/5] DOC: Shortened note, moved alternatives to main text

---
 doc/source/io.rst | 82 ++++++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 43 deletions(-)

diff --git a/doc/source/io.rst b/doc/source/io.rst
index 6823bd05000fd..cc5b17fcd1464 100644
--- a/doc/source/io.rst
+++ b/doc/source/io.rst
@@ -435,18 +435,48 @@ individual columns:
     df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
     df.dtypes
 
+Fortunately, ``pandas`` offers more than one way to ensure that your column(s)
+contain only one ``dtype``. For instance, you can use the ``converters`` argument
+of :func:`~pandas.read_csv`:
+
+.. ipython:: python
+
+    data = "col_1\n1\n2\n'A'\n4.22"
+    df = pd.read_csv(StringIO(data), converters={'col_1':str})
+    df
+    df['col_1'].apply(type).value_counts()
+
+Or you can use the :func:`~pandas.to_numeric` function to coerce the
+dtypes after reading in the data,
+
+.. ipython:: python
+
+    df2 = pd.read_csv(StringIO(data))
+    df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')
+    df2
+    df2['col_1'].apply(type).value_counts()
+
+which would convert all valid parsing to floats, leaving the invalid parsing
+as ``NaN``.
+
+Ultimately, how you deal with reading in columns containing mixed dtypes
+depends on your specific needs. In the case above, if you wanted to ``NaN`` out
+the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
+However, if you wanted for all the data to be coerced, no matter the type, then
+using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be
+worth trying.
+
 .. note::
     The ``dtype`` option is currently only supported by the C engine.
     Specifying ``dtype`` with ``engine`` other than 'c' raises a
     ``ValueError``.
 
 .. note::
-
-   Reading in data with columns containing mixed dtypes and relying
-   on ``pandas`` to infer them is not recommended. In doing so, the
-   parsing engine will infer the dtypes for different chunks of the data,
-   rather than the whole dataset at once. Consequently, you can end up with
-   column(s) with mixed dtypes. For example,
+   In some cases, reading in abnormal data with columns containing mixed dtypes
+   will result in an inconsistent dataset. If you rely on pandas to infer the
+   dtypes of your columns, the parsing engine will go and infer the dtypes for
+   different chunks of the data, rather than the whole dataset at once. Consequently,
+   you can end up with column(s) with mixed dtypes. For example,
 
    .. ipython:: python
          :okwarning:
@@ -458,45 +488,11 @@ individual columns:
        mixed_df['col_1'].dtype
 
    will result with `mixed_df` containing an ``int`` dtype for certain chunks
-   of the column, and ``str`` for others due to a problem during parsing.
-   It is important to note that the overall column will be marked with a
-   ``dtype`` of ``object``, which is used for columns with mixed dtypes.
-
-   Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
-   contain only one ``dtype``. For instance, you could use the ``converters``
-   argument of :func:`~pandas.read_csv`
-
-   .. ipython:: python
-
-       fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
-       fixed_df1['col_1'].apply(type).value_counts()
-
-   Or you could use the :func:`~pandas.to_numeric` function to coerce the
-   dtypes after reading in the data,
-
-   .. ipython:: python
-         :okwarning:
-
-       fixed_df2 = pd.read_csv('foo')
-       fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
-       fixed_df2['col_1'].apply(type).value_counts()
-
-   which would convert all valid parsing to floats, leaving the invalid parsing
-   as ``NaN``.
-
-   Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv`
-   to ``False``. Such as,
-
-   .. ipython:: python
+   of the column, and ``str`` for others due to the mixed dtypes from the
+   data that was read in. It is important to note that the overall column will be
+   marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.
 
-      fixed_df3 = pd.read_csv('foo', low_memory=False)
-      fixed_df3['col_1'].apply(type).value_counts()
 
-   Ultimately, how you deal with reading in columns containing mixed dtypes
-   depends on your specific needs. In the case above, if you wanted to ``NaN`` out
-   the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
-   However, if you wanted for all the data to be coerced, no matter the type, then
-   using the ``converters`` argument of :func:`~pandas.read_csv` would certainly work.
 
 Naming and Using Columns
 ''''''''''''''''''''''''

From 7400607fa5d9ccef19a826d4581c65d393b4e237 Mon Sep 17 00:00:00 2001
From: wcwagner <wcw13@my.fsu.edu>
Date: Tue, 26 Jul 2016 20:29:45 -0400
Subject: [PATCH 5/5] DOC: Added refs to basics.dtypes and
 basics.object_conversion, added whatsnew entry

---
 doc/source/basics.rst           | 2 ++
 doc/source/io.rst               | 8 +++++++-
 doc/source/whatsnew/v0.19.0.txt | 2 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/doc/source/basics.rst b/doc/source/basics.rst
index 63a7c8fded2db..1f670fb7fb593 100644
--- a/doc/source/basics.rst
+++ b/doc/source/basics.rst
@@ -1751,6 +1751,8 @@ Convert a subset of columns to a specified type using :meth:`~DataFrame.astype`
        dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)
        dft.dtypes
 
+.. _basics.object_conversion:
+
 object conversion
 ~~~~~~~~~~~~~~~~~
 
diff --git a/doc/source/io.rst b/doc/source/io.rst
index cc5b17fcd1464..e3b03b5a39b37 100644
--- a/doc/source/io.rst
+++ b/doc/source/io.rst
@@ -436,7 +436,13 @@ individual columns:
     df.dtypes
 
 Fortunately, ``pandas`` offers more than one way to ensure that your column(s)
-contain only one ``dtype``. For instance, you can use the ``converters`` argument
+contain only one ``dtype``. If you're unfamiliar with these concepts, you can
+see :ref:`here<basics.dtypes>` to learn more about dtypes, and
+:ref:`here<basics.object_conversion>` to learn more about ``object`` conversion in
+``pandas``.
+
+
+For instance, you can use the ``converters`` argument
 of :func:`~pandas.read_csv`:
 
 .. ipython:: python
diff --git a/doc/source/whatsnew/v0.19.0.txt b/doc/source/whatsnew/v0.19.0.txt
index 06625e09d70a1..86d60ca48ea6e 100644
--- a/doc/source/whatsnew/v0.19.0.txt
+++ b/doc/source/whatsnew/v0.19.0.txt
@@ -323,6 +323,8 @@ Other enhancements
                        index=['row1', 'row2'])
      df.sort_values(by='row2', axis=1)
 
+- Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)
+
 .. _whatsnew_0190.api: