Skip to content

DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #12922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
batterseapower opened this issue Apr 19, 2016 · 4 comments
Closed
Labels
Bug IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@batterseapower
Copy link
Contributor

Failing test

def test_pandas():
    import tempfile
    import csv
    import pandas as pd
    import numpy as np

    df = pd.DataFrame.from_dict({'column': [1.0, 2.0]})
    assert df['column'].dtype == np.dtype('float')

    with tempfile.TemporaryFile() as f:
        df.to_csv(f, quoting=csv.QUOTE_NONNUMERIC, index=False)

        f.seek(0)
        lines = f.read().splitlines()
        assert lines[0] == '"column"'
        assert not lines[1].startswith('"') # <--- THIS FAILS
        assert [1, 2] == map(float, lines[1:])

The issue is that the floats are being output wrapped with quotes, even though I requested QUOTE_NONNUMERIC.

The problem is that pandas.core.internals.FloatBlock.to_native_types (and by extension pandas.formats.format.FloatArrayFormatter.get_result_as_array) unconditionally formats the float array to a str array, which is then passed unchanged to the csv module and hence will be wrapped in quotes by that code.

I'm not 100% sure but the fix may be to have FloatBlock.to_native_types check if quoting is set, and if so to skip using the FloatArrayFormatter? I say this because pandas.indexes.base.Index._format_native_types already has a special case along these lines. This does seem a bit dirty though!

Here is an awful monkeypatch that works around the problem:

orig_to_native_types = pd.core.internals.FloatBlock.to_native_types
def to_native_types(self, *args, **kwargs):
    if kwargs.get('quoting'):
        values = self.values
        slicer = kwargs.get('slicer')
        if slicer is not None:
            values = values[:, slicer]

        return values

    res = orig_to_native_types(self, *args, **kwargs)
    print 'FloatBlock.to_native_types', args, kwargs, '=', res
    return res
pd.core.internals.FloatBlock.to_native_types = to_native_types

output of pd.show_versions()

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: None
pip: 8.1.1
setuptools: 7.0
Cython: 0.20.1
numpy: 1.11.0
scipy: 0.13.3
statsmodels: None
xarray: None
IPython: 3.2.1
sphinx: None
patsy: 0.3.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: 2.0.4
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: 3.3.2
bs4: 4.2.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.7.2
boto: None
@jreback
Copy link
Contributor

jreback commented Apr 19, 2016

This is probably only minimally tested now. The data is written thru the csv writer which gets passed the quoting. I think that needs to be turned off as we do all quoting formatting before passing it to the writer (maybe not ALL, and that's the rub, some cases maybe relying on the csv writer actually quoting things).
All things are passed as formatted strings (e.g. we do this for floats for example to provide NaN formatting, specific format strings and such).

@jreback jreback added Bug Output-Formatting __repr__ of pandas objects, to_string IO CSV read_csv, to_csv Difficulty Intermediate labels Apr 19, 2016
@jreback jreback added this to the Next Major Release milestone Apr 19, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release Jun 14, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 15, 2016
Float values were being quoted despite the quoting spec.
Bug traced to the float formatting that was unconditionally
casting all floats to string. Unconditional casting traced
back to commit 2d51b33 (pandas-devgh-12194) via bisection. This commit
undoes some of those changes to rectify the behaviour.

Closes pandas-devgh-12922.

[ci skip]
@k-dahl
Copy link

k-dahl commented Sep 9, 2016

This problem is still occurring if you are using a float format, i.e.:

df.to_csv(f, quoting=csv.QUOTE_NONNUMERIC, index=False, float_format='%.2f')

Result:

"column"
"1.00"
"2.00"

Edit: It also appears to be doing the same to NaN values even without the float_format.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

@blitzd the issue here is the second you apply a float format it is now a string. So this is correct. That said I think we could document this. Can you open new issue for that.

@k-dahl
Copy link

k-dahl commented Sep 9, 2016

@jreback I can see that being the case for the format with a format string that would make it non-numeric. I would argue that 1.00 is still a numeric value though.

Also - any thoughts on the NaN bit? That occurs regardless of the float_format.

How hard would it be to have it where you could explicitly define the columns to be quoted? Or is this already possible?

I will add a new issue with reference.

Edit: Re: 'how hard would it be', a bit of a hackish method but it works for me:

def write_csv(df, filename, quote_header=True, unquoted_cols=[], index=False, float_format='%.2f'):
    import pandas as pd

    working_df = df.copy()
    for col in working_df.columns:
        if col not in unquoted_cols:
            working_df[col] = working_df[col].apply(lambda x: '""' if pd.isnull(x) else '"{}"'.format(x))
    if quote_header:
        rename_dict = {}
        for col in working_df.columns:
            rename_dict[col] = '"{}"'.format(col)

        working_df.rename(columns=rename_dict, inplace=True)

    working_df.to_csv(filename, quoting=csv.QUOTE_NONE, index=index, float_format=float_format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

3 participants