@@ -435,18 +435,48 @@ individual columns:
435
435
df = pd.read_csv(StringIO(data), dtype = {' b' : object , ' c' : np.float64})
436
436
df.dtypes
437
437
438
+ Fortunately, ``pandas `` offers more than one way to ensure that your column(s)
439
+ contain only one ``dtype ``. For instance, you can use the ``converters `` argument
440
+ of :func: `~pandas.read_csv `:
441
+
442
+ .. ipython :: python
443
+
444
+ data = " col_1\n 1\n 2\n 'A'\n 4.22"
445
+ df = pd.read_csv(StringIO(data), converters = {' col_1' :str })
446
+ df
447
+ df[' col_1' ].apply(type ).value_counts()
448
+
449
+ Or you can use the :func: `~pandas.to_numeric ` function to coerce the
450
+ dtypes after reading in the data,
451
+
452
+ .. ipython :: python
453
+
454
+ df2 = pd.read_csv(StringIO(data))
455
+ df2[' col_1' ] = pd.to_numeric(df2[' col_1' ], errors = ' coerce' )
456
+ df2
457
+ df2[' col_1' ].apply(type ).value_counts()
458
+
459
+ which would convert all valid parsing to floats, leaving the invalid parsing
460
+ as ``NaN ``.
461
+
462
+ Ultimately, how you deal with reading in columns containing mixed dtypes
463
+ depends on your specific needs. In the case above, if you wanted to ``NaN `` out
464
+ the data anomalies, then :func: `~pandas.to_numeric ` is probably your best option.
465
+ However, if you wanted for all the data to be coerced, no matter the type, then
466
+ using the ``converters `` argument of :func: `~pandas.read_csv ` would certainly be
467
+ worth trying.
468
+
438
469
.. note ::
439
470
The ``dtype `` option is currently only supported by the C engine.
440
471
Specifying ``dtype `` with ``engine `` other than 'c' raises a
441
472
``ValueError ``.
442
473
443
474
.. note ::
444
-
445
- Reading in data with columns containing mixed dtypes and relying
446
- on ``pandas `` to infer them is not recommended. In doing so, the
447
- parsing engine will infer the dtypes for different chunks of the data,
448
- rather than the whole dataset at once. Consequently, you can end up with
449
- column(s) with mixed dtypes. For example,
475
+ In some cases, reading in abnormal data with columns containing mixed dtypes
476
+ will result in an inconsistent dataset. If you rely on pandas to infer the
477
+ dtypes of your columns, the parsing engine will go and infer the dtypes for
478
+ different chunks of the data, rather than the whole dataset at once. Consequently,
479
+ you can end up with column(s) with mixed dtypes. For example,
450
480
451
481
.. ipython :: python
452
482
:okwarning:
@@ -458,45 +488,11 @@ individual columns:
458
488
mixed_df[' col_1' ].dtype
459
489
460
490
will result with `mixed_df ` containing an ``int `` dtype for certain chunks
461
- of the column, and ``str `` for others due to a problem during parsing.
462
- It is important to note that the overall column will be marked with a
463
- ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
464
-
465
- Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
466
- contain only one ``dtype ``. For instance, you could use the ``converters ``
467
- argument of :func: `~pandas.read_csv `
468
-
469
- .. ipython :: python
470
-
471
- fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
472
- fixed_df1[' col_1' ].apply(type ).value_counts()
473
-
474
- Or you could use the :func: `~pandas.to_numeric ` function to coerce the
475
- dtypes after reading in the data,
476
-
477
- .. ipython :: python
478
- :okwarning:
479
-
480
- fixed_df2 = pd.read_csv(' foo' )
481
- fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
482
- fixed_df2[' col_1' ].apply(type ).value_counts()
483
-
484
- which would convert all valid parsing to floats, leaving the invalid parsing
485
- as ``NaN ``.
486
-
487
- Alternatively, you could set the ``low_memory `` argument of :func: `~pandas.read_csv `
488
- to ``False ``. Such as,
489
-
490
- .. ipython :: python
491
+ of the column, and ``str `` for others due to the mixed dtypes from the
492
+ data that was read in. It is important to note that the overall column will be
493
+ marked with a ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
491
494
492
- fixed_df3 = pd.read_csv(' foo' , low_memory = False )
493
- fixed_df3[' col_1' ].apply(type ).value_counts()
494
495
495
- Ultimately, how you deal with reading in columns containing mixed dtypes
496
- depends on your specific needs. In the case above, if you wanted to ``NaN `` out
497
- the data anomalies, then :func: `~pandas.to_numeric ` is probably your best option.
498
- However, if you wanted for all the data to be coerced, no matter the type, then
499
- using the ``converters `` argument of :func: `~pandas.read_csv ` would certainly work.
500
496
501
497
Naming and Using Columns
502
498
''''''''''''''''''''''''
0 commit comments