DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #12922

batterseapower · 2016-04-19T07:10:14Z

Failing test

def test_pandas():
    import tempfile
    import csv
    import pandas as pd
    import numpy as np

    df = pd.DataFrame.from_dict({'column': [1.0, 2.0]})
    assert df['column'].dtype == np.dtype('float')

    with tempfile.TemporaryFile() as f:
        df.to_csv(f, quoting=csv.QUOTE_NONNUMERIC, index=False)

        f.seek(0)
        lines = f.read().splitlines()
        assert lines[0] == '"column"'
        assert not lines[1].startswith('"') # <--- THIS FAILS
        assert [1, 2] == map(float, lines[1:])

The issue is that the floats are being output wrapped with quotes, even though I requested QUOTE_NONNUMERIC.

The problem is that pandas.core.internals.FloatBlock.to_native_types (and by extension pandas.formats.format.FloatArrayFormatter.get_result_as_array) unconditionally formats the float array to a str array, which is then passed unchanged to the csv module and hence will be wrapped in quotes by that code.

I'm not 100% sure but the fix may be to have FloatBlock.to_native_types check if quoting is set, and if so to skip using the FloatArrayFormatter? I say this because pandas.indexes.base.Index._format_native_types already has a special case along these lines. This does seem a bit dirty though!

Here is an awful monkeypatch that works around the problem:

orig_to_native_types = pd.core.internals.FloatBlock.to_native_types
def to_native_types(self, *args, **kwargs):
    if kwargs.get('quoting'):
        values = self.values
        slicer = kwargs.get('slicer')
        if slicer is not None:
            values = values[:, slicer]

        return values

    res = orig_to_native_types(self, *args, **kwargs)
    print 'FloatBlock.to_native_types', args, kwargs, '=', res
    return res
pd.core.internals.FloatBlock.to_native_types = to_native_types

output of `pd.show_versions()`

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: None
pip: 8.1.1
setuptools: 7.0
Cython: 0.20.1
numpy: 1.11.0
scipy: 0.13.3
statsmodels: None
xarray: None
IPython: 3.2.1
sphinx: None
patsy: 0.3.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: 2.0.4
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: 3.3.2
bs4: 4.2.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.7.2
boto: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-19T12:15:40Z

This is probably only minimally tested now. The data is written thru the csv writer which gets passed the quoting. I think that needs to be turned off as we do all quoting formatting before passing it to the writer (maybe not ALL, and that's the rub, some cases maybe relying on the csv writer actually quoting things).
All things are passed as formatted strings (e.g. we do this for floats for example to provide NaN formatting, specific format strings and such).

Float values were being quoted despite the quoting spec. Bug traced to the float formatting that was unconditionally casting all floats to string. Unconditional casting traced back to commit 2d51b33 (pandas-devgh-12194) via bisection. This commit undoes some of those changes to rectify the behaviour. Closes pandas-devgh-12922. [ci skip]

k-dahl · 2016-09-09T15:51:47Z

This problem is still occurring if you are using a float format, i.e.:

df.to_csv(f, quoting=csv.QUOTE_NONNUMERIC, index=False, float_format='%.2f')

Result:

"column"
"1.00"
"2.00"

Edit: It also appears to be doing the same to NaN values even without the float_format.

jreback · 2016-09-09T19:54:36Z

@blitzd the issue here is the second you apply a float format it is now a string. So this is correct. That said I think we could document this. Can you open new issue for that.

k-dahl · 2016-09-09T20:23:14Z

@jreback I can see that being the case for the format with a format string that would make it non-numeric. I would argue that 1.00 is still a numeric value though.

Also - any thoughts on the NaN bit? That occurs regardless of the float_format.

How hard would it be to have it where you could explicitly define the columns to be quoted? Or is this already possible?

I will add a new issue with reference.

Edit: Re: 'how hard would it be', a bit of a hackish method but it works for me:

def write_csv(df, filename, quote_header=True, unquoted_cols=[], index=False, float_format='%.2f'):
    import pandas as pd

    working_df = df.copy()
    for col in working_df.columns:
        if col not in unquoted_cols:
            working_df[col] = working_df[col].apply(lambda x: '""' if pd.isnull(x) else '"{}"'.format(x))
    if quote_header:
        rename_dict = {}
        for col in working_df.columns:
            rename_dict[col] = '"{}"'.format(col)

        working_df.rename(columns=rename_dict, inplace=True)

    working_df.to_csv(filename, quoting=csv.QUOTE_NONE, index=index, float_format=float_format)

jreback added Bug Output-Formatting __repr__ of pandas objects, to_string IO CSV read_csv, to_csv Difficulty Intermediate labels Apr 19, 2016

jreback added this to the Next Major Release milestone Apr 19, 2016

jreback mentioned this issue May 23, 2016

DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) now also quotes numeric values #13259

Closed

gfyoung mentioned this issue Jun 10, 2016

BUG: Fix csv.QUOTE_NONNUMERIC quoting in to_csv #13418

Closed

jreback modified the milestones: 0.18.2, Next Major Release Jun 14, 2016

jreback closed this as completed in d814f43 Jun 16, 2016

This was referenced Aug 28, 2016

Performance regression in DataFrame.to_csv #14110

Closed

Saving CSV with backslashed-escaping is not idempotent. #14122

Open

k-dahl mentioned this issue Sep 9, 2016

Re: DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #14195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #12922

DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #12922

batterseapower commented Apr 19, 2016

jreback commented Apr 19, 2016

Uh oh!

k-dahl commented Sep 9, 2016 •

edited

Loading

Uh oh!

jreback commented Sep 9, 2016

Uh oh!

k-dahl commented Sep 9, 2016 •

edited

Loading

Uh oh!

Uh oh!

DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #12922

DataFrame.to_csv(quoting=csv.QUOTE_NONNUMERIC) quotes numeric values #12922

Comments

batterseapower commented Apr 19, 2016

Failing test

output of pd.show_versions()

jreback commented Apr 19, 2016

Uh oh!

k-dahl commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Sep 9, 2016

Uh oh!

k-dahl commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

output of `pd.show_versions()`

k-dahl commented Sep 9, 2016 •

edited

Loading

k-dahl commented Sep 9, 2016 •

edited

Loading