Skip to content

Unicode strings unexpectedly transformed to byte strings upon open_dataset #4859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kripnerl opened this issue Feb 3, 2021 · 7 comments
Closed

Comments

@kripnerl
Copy link

kripnerl commented Feb 3, 2021

What happened:

Unicode coordinates convert to bytes after saving/loading with h5netcdf backend. This results with the practically unusable dataset (bytes != string).

What you expected to happen:

Load the string as a string.

Minimal Complete Verifiable Example:

coils = np.array(["A", "B", "C", "D", "E"])
data = np.array([1 + 1j, 1 + 2j, 3j, 4j, 6j])

test_ds = xr.Dataset()
test_ds.coords["coils"] = coils
test_ds["data"] = ("coils", data)
test_ds

> <xarray.Dataset>
> Dimensions:  (coils: 5)
> Coordinates:
>  * coils    (coils) <U1 'A' 'B' 'C' 'D' 'E'
> Data variables:
>    data     (coils) complex128 (1+1j) (1+2j) 3j 4j 6j



test_ds.to_netcdf("test.nc", engine="h5netcdf")
del test_ds
test_ds = xr.open_dataset("test.nc", engine="h5netcdf")
test_ds

> <xarray.Dataset>
> Dimensions:  (coils: 5)
> Coordinates:
>   * coils    (coils) object b'A' b'B' b'C' b'D' b'E'
> Data variables:
>     data     (coils) complex128 ...

Anything else we need to know?:

The issue may be related to #1638.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-65-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: cs_CZ.UTF-8
LOCALE: cs_CZ.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.6.3

xarray: 0.16.2
pandas: 1.1.5
numpy: 1.19.4
scipy: 1.5.3
netCDF4: 1.5.5
pydap: None
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: None
cftime: 1.3.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: 3.3.3
cartopy: None
seaborn: 0.11.0
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20201009
pip: 20.3.3
conda: None
pytest: 6.2.1
IPython: 7.19.0
sphinx: None

@mathause
Copy link
Collaborator

mathause commented Feb 3, 2021

Yes, there was a behaviour change in h5py. A fix is on the way but not yet released: h5netcdf/h5netcdf#81

For the time being you can downgrade h5py: conda install h5py=2 which should fix the issue.

@kripnerl
Copy link
Author

kripnerl commented Feb 3, 2021

Possible solution to my problem is:

test_ds.coords["coils"] = test_ds.coils.values.astype(np.unicode_)
test_ds

@kripnerl
Copy link
Author

kripnerl commented Feb 3, 2021

Yes, there was a behaviour change in h5py. A fix is on the way but not yet released: h5netcdf/h5netcdf#81

For the time being you can downgrade h5py: conda install h5py=2 which should fix the issue.

Thank you!

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Feb 3, 2021

@kripnerl The actual fix is in h5netcdf/h5netcdf#82.

@mathause Would you mind having a look at the proposed changes, also with respect to the implications for xarray?

@kmuehlbauer
Copy link
Contributor

@kniperl You can test with #4893. This should fix the bytes issue.

But It seems that converting to string objects is by design (using "netcdf4").

coils = np.array(["A", "B", "C", "D", "E"])
test_ds = xr.Dataset()
test_ds.coords["coils"] = coils
print(test_ds)

> <xarray.Dataset>
> Dimensions:  (coils: 5)
> Coordinates:
>  * coils    (coils) <U1 'A' 'B' 'C' 'D' 'E'
> Data variables:
>    *empty*

test_ds.to_netcdf("test_netcdf4.nc", engine="netcdf4")
del test_ds
test_ds = xr.open_dataset("test_netcdf4.nc", engine="netcdf4")
print(test_ds)

> <xarray.Dataset>
> Dimensions:  (coils: 5)
> Coordinates:
>  * coils    (coils) object 'A' 'B' 'C' 'D' 'E'
> Data variables:
>    *empty*

@kripnerl
Copy link
Author

@kmuehlbauer Thanks a lot, I will check it ASAP. Yop, conversion to object from U4 is, I believe, normal behaviour. However, this does not cause any trouble for me so far.

@kmuehlbauer
Copy link
Contributor

@mathause This can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants