Skip to content

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

@dragoljub

Description

@dragoljub

I noticed in the documentation:

"
Specifying column data types
Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"

This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.

From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....

In [1]:

data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': np.float32, 'c': np.int16})

df.dtypes

Out[1]: <-- Upcasted to 64 bit

a object
b float64
c int64

The upcasting occurs even if I explicitly try to set a column to a 32bit float/int.

In [2]:

df['b'] = df['b'].astype(np.float32)
type(df['b'][0])

Out[2]: <-- Upcasted to 64 bit even with explicit column set.

numpy.float64

However, if I start with the object type, and then explicitly set the column to a float16/32 everything it seems to work.

In [3]:

data = 'a,b,c\nCat,2.3456789,3\nDog,5,6\nHat,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': object, 'c': np.int16})

df.dtypes

Out[3]:

a object
b object
c int64 <-- upcasting during read_csv

In [4]:

print type(df['b'][0])

df['b'][0]

<type 'str'> <-- here we have the string type object not parsed

Out[4]:
'2.3456789'

In [5]:

df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16

In [6]:

print type(df['b'][0])

print df['b'][0]

print df.dtypes

<type 'numpy.float16'>
2.3457
a object
b float16 <-- Yay 16 bit!
c int64

<-- Correctly Cast Object into float32, with correct truncation of the data value.

Now my next question is does this have any possibly bad memory implications? When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.

Thanks for any input,
-Gagi

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions