How should xarray serialize bytes/unicode strings across Python/netCDF versions?

# netCDF string types

We have several options for storing strings in netCDF files:
- `NC_CHAR`: netCDF's legacy character type. The closest match is NumPy `'S1'` dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes).
- `NC_STRING`: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding.
- `NC_CHAR` with an `_Encoding` attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in `NC_CHAR` data-types, by adding an attribute `{'_Encoding': 'UTF-8'}`. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode.

`NC_STRING` would seem like a clear win in cases where it's supported, but as @crusaderky points out in https://github.com/pydata/xarray/issues/2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in `NC_STRING`, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings.

# NumPy/Python string types

On the Python side, our options are perhaps even more confusing:
- NumPy's `dtype=np.string_` corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes.
- NumPy's `dtype=np.unicode_` corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode.
- Strings are also commonly stored in numpy arrays with `dtype=np.object_`, as arrays of either `bytes` or `unicode` objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with `np.nan`.

Like pandas, we are pretty liberal with converting back and forth between fixed-length (`np.string`/`np.unicode_`) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask.

# Current behavior of xarray

Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves:

| Python version | NetCDF version | NumPy datatype | NetCDF datatype |
| --------- | ---------- | -------------- | ------------ |
| Python 2 | NETCDF3 | np.string_ / str | NC_CHAR |
| Python 2 | NETCDF4 | np.string_ / str | NC_CHAR |
| Python 3 | NETCDF3 | np.string_ / bytes | NC_CHAR |
| Python 3 | NETCDF4 | np.string_ / bytes | NC_CHAR |
| Python 2 | NETCDF3 | np.unicode_ / unicode | NC_CHAR with UTF-8 encoding |
| Python 2 | NETCDF4 | np.unicode_ / unicode | NC_STRING |
| Python 3 | NETCDF3 | np.unicode_ / str | NC_CHAR with UTF-8 encoding |
| Python 3 | NETCDF4 | np.unicode_ / str | NC_STRING |
| Python 2 | NETCDF3 | object bytes/str | NC_CHAR |
| Python 2 | NETCDF4 | object bytes/str | NC_CHAR |
| Python 3 | NETCDF3 | object bytes | NC_CHAR |
| Python 3 | NETCDF4 | object bytes | NC_CHAR |
| Python 2 | NETCDF3 | object unicode | NC_CHAR with UTF-8 encoding |
| Python 2 | NETCDF4 | object unicode | NC_STRING |
| Python 3 | NETCDF3 | object unicode/str | NC_CHAR with UTF-8 encoding |
| Python 3 | NETCDF4 | object unicode/str | NC_STRING |

This can also be selected explicitly for most data-types by setting dtype in encoding:
- `'S1'` for NC_CHAR (with or without encoding)
- `str` for NC_STRING (though I'm not 100% sure it works properly currently when given bytes)

Script for generating table:
<details>

```python
from __future__ import print_function
import xarray as xr
import uuid
import netCDF4
import numpy as np
import sys

for dtype_name, value in [
    ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])),
    ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])),
    ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)),
    ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)),
]:
    for format in ['NETCDF3_64BIT', 'NETCDF4']:
        filename = str(uuid.uuid4()) + '.nc'
        xr.Dataset({'data': value}).to_netcdf(filename, format=format)
        with netCDF4.Dataset(filename) as f:
            var = f.variables['data']
            disk_dtype = var.dtype
            has_encoding = hasattr(var, '_Encoding')
        disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') +
                           (' with UTF-8 encoding' if has_encoding else ''))
        print('|', 'Python %i' % sys.version_info[0],
              '|', format[:7],
              '|', dtype_name,
              '|', disk_dtype_name,
              '|')
```
</details>

# Potential alternatives

The main option I'm considering is switching to default to `NC_CHAR` with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of `{'_Encoding': None}`.

This would imply two changes:
1. Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling `_Encoding`.
2. Strings read back from disk on Python 2 would come back as unicode instead of bytes.

This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2.

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

netCDF string types

NumPy/Python string types

Current behavior of xarray

Potential alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python version	NetCDF version	NumPy datatype	NetCDF datatype
Python 2	NETCDF3	np.string_ / str	NC_CHAR
Python 2	NETCDF4	np.string_ / str	NC_CHAR
Python 3	NETCDF3	np.string_ / bytes	NC_CHAR
Python 3	NETCDF4	np.string_ / bytes	NC_CHAR
Python 2	NETCDF3	np.unicode_ / unicode	NC_CHAR with UTF-8 encoding
Python 2	NETCDF4	np.unicode_ / unicode	NC_STRING
Python 3	NETCDF3	np.unicode_ / str	NC_CHAR with UTF-8 encoding
Python 3	NETCDF4	np.unicode_ / str	NC_STRING
Python 2	NETCDF3	object bytes/str	NC_CHAR
Python 2	NETCDF4	object bytes/str	NC_CHAR
Python 3	NETCDF3	object bytes	NC_CHAR
Python 3	NETCDF4	object bytes	NC_CHAR
Python 2	NETCDF3	object unicode	NC_CHAR with UTF-8 encoding
Python 2	NETCDF4	object unicode	NC_STRING
Python 3	NETCDF3	object unicode/str	NC_CHAR with UTF-8 encoding
Python 3	NETCDF4	object unicode/str	NC_STRING

Uh oh!

How should xarray serialize bytes/unicode strings across Python/netCDF versions? #2059

Description

netCDF string types

NumPy/Python string types

Current behavior of xarray

Potential alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions