Skip to content

truncation issue with pd.read_csv #7072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nnaman opened this issue May 8, 2014 · 5 comments · Fixed by #30554
Closed

truncation issue with pd.read_csv #7072

nnaman opened this issue May 8, 2014 · 5 comments · Fixed by #30554
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@nnaman
Copy link

nnaman commented May 8, 2014

hi, thank you very much for the awesome pandas tool. I would like to report an issue i noticed and seek guidance on remedial steps. When i store a long integer into a file using pd.to_csv, it stores the data fine - but when i read it back using pd.read_csv, it messes with the last 3 digits. When i try to save it back again using to_csv (without any edits), the numbers in resulting CSV file is different from the original CSV file. I've illustrated the problem below (notice how 4321113141090630389 becomes 4321113141090630400 and 4321583677327450765 becomes 4321583677327450880):

original CSV file created by pd.to_csv:

grep -e 321583677327450 -e 321113141090630 orig.piece
orig.piece:1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4321583677327450765
orig.piece:5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4321113141090630389

import pandas as pd
import numpy as np

orig = pd.read_csv('orig.piece')
orig.dtypes
Unnamed: 0 int64
aa object
act float64
...
...
s_act float64
dtype: object
orig['s_act'].head(6)
0 NaN
1 4.321584e+18
2 4.321974e+18
3 4.321494e+18
4 4.321283e+18
5 4.321113e+18
Name: s_act, dtype: float64

orig['s_act'].fillna(0).astype(int).head(6)
0 0
1 4321583677327450880
2 4321973950881710336
3 4321493786516159488
4 4321282586859217408
5 4321113141090630400

orig.to_csv('convert.piece')

grep -e 321583677327450 -e 321113141090630 orig.piece convert.piece
orig.piece:1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4321583677327450765
orig.piece:5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4321113141090630389
convert.piece:1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4.321583677327451e+18
convert.piece:5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4.3211131410906304e+18

could you please help me understand why read_csv jumbles the last three digits? Its not even a rounding issue, the digits are totally different (like 4321583677327450765 becomes 4321583677327450880 above) Is it because of the scientific notation that comes in the way - how can we disable it and let pandas treat this data as jus object/string or plan integer/float?

@nnaman
Copy link
Author

nnaman commented May 8, 2014

btw, when i say orig = pd.read_csv('orig.piece',dtype=str), the problem goes away. But, is there any downside to this? also, this sounds like a workaround and not a fix.

@jreback
Copy link
Contributor

jreback commented May 8, 2014

this is the user responsibility; reading in as dtype=object (this is what dtype=str does) is the appropriate method. That said if you wanted to do a PR that does some sort of warning and does not impact performance, I think would be ok

@jreback
Copy link
Contributor

jreback commented May 8, 2014

related #2511

@jreback jreback added this to the Someday milestone May 8, 2014
@kokes
Copy link
Contributor

kokes commented Oct 2, 2018

I can't reproduce this anymore, using pandas 0.23.4

In [9]: df=pd.read_csv('dt.csv', header=None)

In [10]: df[4]
Out[10]: 
0    4321583677327450765
1    4321113141090630389
Name: 4, dtype: int64

In [11]: with open('dt.csv') as f:
    ...:     print(f.read())
    ...:     
1,1;0;0;0;1;1;3844;3844;3844;1;1;1;1;1;1;0;0;1;1;0;0,,,4321583677327450765
5,1;0;0;0;1;1;843;843;843;1;1;1;1;1;1;0;0;1;1;0;0,64.0,;,4321113141090630389```

@mroeschke
Copy link
Member

Could use a regression test.

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 20, 2019
@simonjayhawkins simonjayhawkins modified the milestones: Someday, 1.0 Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants