-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Representing missing values in string arrays on disk #1647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
4 tasks
It occurs to me that yet another option is to avoid using
|
5 tasks
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
6 tasks
This should be possible now: ds = xr.Dataset({'foo': ('x', np.array(['bar', np.nan], dtype=object), {}, {'_FillValue': 'baz'})})
ds.to_netcdf('foobar.nc')
print(xr.open_dataset('foobar.nc').load()) <xarray.Dataset>
Dimensions: (x: 2)
Dimensions without coordinates: x
Data variables:
foo (x) object 'bar' nan |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This came up as part of my clean-up of serializing unicode strings in #1648.
There are two ways to represent strings in netCDF files.
NC_CHAR
), supported by both netCDF3 and netCDF4NC_STRING
), only supported by netCDF4/HDF5.Currently, by default (if no
_FillValue
is set) we replace missing values (NaN) with an empty string when writing data to disk.For character arrays, we could use the normal
_FillValue
mechanism to set a fill value and decode when data is read back from disk. In fact, this already currently works fordtype=bytes
(though it isn't documented):For variable length strings, it currently isn't possible to set a fill-value. So there's no good way to indicate missing values, though this may change if the future depending on the resolution of the netCDF-python issue.
It would obviously be nice to always automatically round-trip missing values, both for strings and bytes. I see two possible ways to do this:
_FillValue
when a string contains missing values, by raising an error if this isn't done. We need an explicit choice because there aren't any extra unused characters left over, at least for character arrays. (NetCDF explicitly allows arbitrary bytes to be stored inNC_CHAR
, even though this maps to an HDF5 fixed-width string with ASCII encoding.)For variable length strings, we could potentially set a non-character unicode symbol like
U+FFFF
, but again that isn't supported yet._FillValue
, so we don't need to wait for any netCDF4 issues to be resolved. However, this does mean that empty strings would not round-trip. Still, given the relative prevalence of missing values vs empty strings in xarray/pandas, it's probably the lesser evil to not preserve empty string.The default option is to adopt neither of these, and keep the current behavior where missing values are written as empty strings and not decoded at all.
Any opinions? I am leaning towards option (2).
The text was updated successfully, but these errors were encountered: