read_csv() option to parse numeric columns to np.float32 or np.float16

I noticed in the documentation:

"
Specifying column data types
Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"

This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.

From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....

In [1]:

data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': np.float32, 'c': np.int16})

df.dtypes

Out[1]: <-- Upcasted to 64 bit

a     object
b    float64
c      int64

The upcasting occurs even if I explicitly try to set a column to a 32bit float/int.

In [2]:

df['b'] = df['b'].astype(np.float32)
type(df['b'][0])

Out[2]: <-- Upcasted to 64 bit even with explicit column set.

numpy.float64

However, if I start with the object type, and then explicitly set the column to a float16/32 everything it seems to work.

In [3]:

data = 'a,b,c\nCat,2.3456789,3\nDog,5,6\nHat,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': object, 'c': np.int16})

df.dtypes

Out[3]:

a    object
b    object
c     int64 <-- upcasting during read_csv

In [4]:

print type(df['b'][0])

df['b'][0]

<type 'str'> <-- here we have the string type object not parsed

Out[4]:
'2.3456789'

In [5]:

df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16

In [6]:

print type(df['b'][0])

print df['b'][0]

print df.dtypes

<type 'numpy.float16'>
2.3457
a     object
b    float16 <-- Yay 16 bit!
c      int64

 <-- Correctly Cast Object into float32, with correct truncation of the data value.

Now my next question is does this have any possibly bad memory implications?  When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.

Thanks for any input,
-Gagi


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions