Representing missing values in string arrays on disk #1647

shoyer · 2017-10-23T05:01:10Z

This came up as part of my clean-up of serializing unicode strings in #1648.

There are two ways to represent strings in netCDF files.

As character arrays (NC_CHAR), supported by both netCDF3 and netCDF4
As variable length unicode strings (NC_STRING), only supported by netCDF4/HDF5.

Currently, by default (if no _FillValue is set) we replace missing values (NaN) with an empty string when writing data to disk.

For character arrays, we could use the normal _FillValue mechanism to set a fill value and decode when data is read back from disk. In fact, this already currently works for dtype=bytes (though it isn't documented):

In [10]: ds = xr.Dataset({'foo': ('x', np.array([b'bar', np.nan], dtype=object), {}, {'_FillValue': b''})})

In [11]: ds
Out[11]:
<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    foo      (x) object b'bar' nan

In [12]: ds.to_netcdf('foobar.nc')

In [13]: xr.open_dataset('foobar.nc').load()
Out[13]:
<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    foo      (x) object b'bar' nan

For variable length strings, it currently isn't possible to set a fill-value. So there's no good way to indicate missing values, though this may change if the future depending on the resolution of the netCDF-python issue.

It would obviously be nice to always automatically round-trip missing values, both for strings and bytes. I see two possible ways to do this:

Require setting an explicit _FillValue when a string contains missing values, by raising an error if this isn't done. We need an explicit choice because there aren't any extra unused characters left over, at least for character arrays. (NetCDF explicitly allows arbitrary bytes to be stored in NC_CHAR, even though this maps to an HDF5 fixed-width string with ASCII encoding.)
For variable length strings, we could potentially set a non-character unicode symbol like U+FFFF, but again that isn't supported yet.
Treat empty strings as equivalent to a missing value (NaN). This has the advantage of not requiring an explicit choice of _FillValue, so we don't need to wait for any netCDF4 issues to be resolved. However, this does mean that empty strings would not round-trip. Still, given the relative prevalence of missing values vs empty strings in xarray/pandas, it's probably the lesser evil to not preserve empty string.

The default option is to adopt neither of these, and keep the current behavior where missing values are written as empty strings and not decoded at all.

Any opinions? I am leaning towards option (2).

The text was updated successfully, but these errors were encountered:

shoyer · 2017-10-23T07:48:56Z

It occurs to me that yet another option is to avoid using _FillValue:

Instead of using _FillValue, set the missing_value attribute, which is not directly used by netCDF libraries. We could thus choose to let missing_value be a unicode string, and do comparison to find missing values after decoding back into unicode. In contrast, _FillValue is required to be a valid scalar for the encoded netCDF variable, which means a single bytes character when using character encoding. It's pretty awkward for valid choices of _FillValue to change based on which netCDF library / version is being used, so I'd really like to avoid that. (In the current version of Roundtrip unicode strings even when written as character arrays #1648, I resolve this issue by encoding unicode fill values into bytes, but this means you need to know that your unicode character encodes into a single byte.)

stale · 2019-09-23T07:58:53Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

kmuehlbauer · 2024-02-06T13:03:40Z

This should be possible now:

ds = xr.Dataset({'foo': ('x', np.array(['bar', np.nan], dtype=object), {}, {'_FillValue': 'baz'})})
ds.to_netcdf('foobar.nc')
print(xr.open_dataset('foobar.nc').load())

<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    foo      (x) object 'bar' nan

shoyer added topic-backends design question labels Oct 23, 2017

shoyer mentioned this issue Oct 23, 2017

Roundtrip unicode strings even when written as character arrays #1648

Merged

4 tasks

delgadom mentioned this issue Dec 29, 2017

Handle _FillValue in variable-length unicode string variables #1802

Closed

5 tasks

stale bot added the stale label Sep 23, 2019

dcherian removed the stale label Sep 23, 2019

benbovy mentioned this issue Apr 28, 2021

Attributes encoding compatibility between backends #5226

Open

kmuehlbauer mentioned this issue Mar 24, 2023

cf-coding #7654

Closed

4 tasks

aburrell mentioned this issue Apr 4, 2023

xarray does not support use of '_FillValue' for unicode strings pysat/pysat#1102

Open

kmuehlbauer mentioned this issue Sep 20, 2023

preserve vlen string dtypes, allow vlen string fill_values #7869

Merged

6 tasks

kmuehlbauer closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing missing values in string arrays on disk #1647

Representing missing values in string arrays on disk #1647

shoyer commented Oct 23, 2017 •

edited

Loading

shoyer commented Oct 23, 2017

stale bot commented Sep 23, 2019

kmuehlbauer commented Feb 6, 2024

Representing missing values in string arrays on disk #1647

Representing missing values in string arrays on disk #1647

Comments

shoyer commented Oct 23, 2017 • edited Loading

shoyer commented Oct 23, 2017

stale bot commented Sep 23, 2019

kmuehlbauer commented Feb 6, 2024

shoyer commented Oct 23, 2017 •

edited

Loading