CF encoding should preserve vlen dtype for empty arrays #7862

tomwhite · 2023-05-22T16:41:11Z

Closes Zarr store array dtype changes for empty object string #7328
Tests added

The idea is that it enables a workaround for #7328, where the dtype metadata can be set to a vlen string so that it doesn't get changed to float.

The line that is changed in the example in #7328 is:

ds = xr.Dataset({"a": np.array([], dtype=coding.strings.create_vlen_dtype(str))})

for more information, see https://pre-commit.ci

kmuehlbauer · 2023-05-24T07:01:44Z

Thanks @tomwhite for the PR. I've only quickly checked the approach, which looks reasonable. But those changes have implications on several locations of the backend code, which we would have to sort out.

Considering this example:

import numpy as np
import xarray as xr
print(f"creating dataset with empty string array")
print("-----------------------------------------")
dtype = xr.coding.strings.create_vlen_dtype(str)
ds = xr.Dataset({"a": np.array([], dtype=dtype)})
print(f"dtype: {ds['a'].dtype}")
print(f"metadata: {ds['a'].dtype.metadata}")
ds.to_netcdf("a.nc", engine="netcdf4")

print("\nncdump")
print("-------")
!ncdump a.nc

engines = ["netcdf4", "h5netcdf"]
for engine in engines:
    with xr.open_dataset("a.nc", engine=engine) as ds:
        print(f"\nloading with {engine}")
        print("-------------------")
        print(f"dtype: {ds['a'].dtype}")
        print(f"metadata: {ds['a'].dtype.metadata}")

creating dataset with empty string array
-----------------------------------------
dtype: object
metadata: {'element_type': <class 'str'>}

ncdump
-------
netcdf a {
dimensions:
	a = UNLIMITED ; // (0 currently)
variables:
	string a(a) ;
data:
}

loading with netcdf4
-------------------
dtype: object
metadata: None

loading with h5netcdf
-------------------
dtype: object
metadata: {'vlen': <class 'str'>}

Engine netcdf4 does not roundtrip here, losing the dtype metadata information. There is special casing for h5netcdf backend, though.

The source is actually located in open_store_variable of netcdf4 backend, when the underlying data is converted to Variable (which does some object dtype twiddling).

Unfortunately I do not have an immediate solution here.

tomwhite · 2023-05-24T13:23:18Z

Thanks for taking a look @kmuehlbauer and for the useful example code. I hadn't considered the netcdf cases, so thanks for pointing those out.

Engine netcdf4 does not roundtrip here, losing the dtype metadata information. There is special casing for h5netcdf backend, though.

Could netcdf4 do the same special-casing as h5netcdf?

kmuehlbauer · 2023-05-24T13:32:26Z

@tomwhite Special casing on netcdf4 backend should be possible, too.

But it might need fixing at zarr backend, too:

ds = xr.Dataset({"a": np.array([], dtype=xr.coding.strings.create_vlen_dtype(str))})
print(f"dtype: {ds['a'].dtype}")
print(f"metadata: {ds['a'].dtype.metadata}")
ds.to_zarr("a.zarr")
print("\n### Loading ###")
with xr.open_dataset("a.zarr", engine="zarr") as ds:
    print(f"dtype: {ds['a'].dtype}")
    print(f"metadata: {ds['a'].dtype.metadata}")

dtype: object
metadata: {'element_type': <class 'str'>}

### Loading ###
dtype: object
metadata: None

Could you verify the above example, please? I'm relatively new to zarr 😬

kmuehlbauer · 2023-05-24T13:52:04Z

@tomwhite I've put a commit with changes to zarr/netcdf4-backends which should preserve the dtype metadata here: https://github.com/kmuehlbauer/xarray/tree/preserve-vlen-string-dtype.

I'm not really sure if that is the right location, but as it was already present that location at netcdf4-backend I think it will do.

tomwhite · 2023-05-24T14:12:49Z

Could you verify the above example, please?

The code looks fine, and I get the same result when I run it with this PR.

Your fix in https://github.com/kmuehlbauer/xarray/tree/preserve-vlen-string-dtype changes the metadata so it is correctly preserved as metadata: {'element_type': <class 'str'>}.

I feel less qualified to evaluate the impact of the netcdf4 fix.

kmuehlbauer · 2023-05-24T14:37:58Z

Thanks for trying. I can't think of any downsides for the netcdf4-fix, as it just adds the needed metadata to the object-dtype. But you never know, so it would be good to get another set of eyes on it.

So it looks like the changes here with the fix in my branch will get your issue resolved @tomwhite, right?

I'm a bit worried, that this might break other users workflows, if they depend on the current conversion to floating point for some reason. Also other backends might rely on this feature. Especially because this has been there since the early days when xarray was known as xray.

@dcherian What would be the way to go here?

There is also a somehow contradicting issue in #7868.

tomwhite · 2023-05-24T14:51:23Z

So it looks like the changes here with the fix in my branch will get your issue resolved @tomwhite, right?

Yes - thanks!

I'm a bit worried, that this might break other users workflows, if they depend on the current conversion to floating point for some reason.

The floating point default is preserved if you do e.g. xr.Dataset({"a": np.array([], dtype=object)}). The change here will only convert to string if there is extra metadata present that says it is a string.

kmuehlbauer

After discussion with @tomwhite this PR should now preserve the vlen dtype for empty arrays (on encoding) and preserve the vlen dtype metadata for zarr/netcdf4 backends.

Tests would have to be added for the latter and a zarr expert might want to have another look at that part.

xarray/backends/netCDF4_.py

xarray/backends/zarr.py

xarray/conventions.py

xarray/tests/test_conventions.py

kmuehlbauer · 2023-06-01T13:06:32Z

@tomwhite I've added tests to check the backend code for vlen string dtype metadadata. Also had to add specific check for the h5py vlen string metadata. I think we've covered everything for the proposed change to allow empty vlen strings dtype metadata.

I'm looking at the mypy error and do not have the slightest clue what and where to change. Any help appreciated.

tomwhite · 2023-06-02T13:44:43Z

@kmuehlbauer thanks for adding tests! I'm not sure what the mypy error is either, I'm afraid...

dcherian · 2023-06-02T20:14:33Z

xarray/tests/test_coding_strings.py:36: error: No overload variant of "dtype" matches argument types "str", "Dict[str, Type[str]]" [call-overload]

cc @Illviljan @headtr1ck

headtr1ck · 2023-06-02T20:27:46Z

This seems to be a numpy issue, mypy thinks that you cannot call np.dtype like you do.

Might be worth an issue over at numpy with the example from the test.

For now we can simply ignore this error.

kmuehlbauer · 2023-06-06T09:04:39Z

Might be worth an issue over at numpy with the example from the test.

numpy/numpy#23886

kmuehlbauer · 2023-06-06T13:30:15Z

Might be worth an issue over at numpy with the example from the test.

numpy/numpy#23886

The issue is already resolved over at numpy which is really great! It was also marked as backport. @headtr1ck How are these issues resolved currently or how do we track removing the ignore?

headtr1ck · 2023-06-06T13:31:34Z

If you want you can leave a comment
But the mypy CI should fail on unused ignores, so we will notice :)

kmuehlbauer · 2023-06-06T13:55:21Z

But the mypy CI should fail on unused ignores, so we will notice :)

I've added an ignore for now.

…eview

jhamman

Overall, this looks approach seems fine to me. I left one comment that should be easy to address. Thanks @tomwhite and @kmuehlbauer for making this work.

xarray/backends/netCDF4_.py

kmuehlbauer · 2023-06-16T05:54:46Z

Thanks @tomwhite and all others for getting this in.

tomwhite · 2023-06-16T07:00:35Z

Thank you @kmuehlbauer for your work on this!

* CF encoding should preserve vlen dtype for empty arrays * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * preserve vlen string dtype in netcdf4 and zarr backends * check for h5py-variant ("vlen") in coding.strings.check_vlen_dtype * add test to check preserving vlen dtype for empty vlen string arrays * ignore call_overload error for np.dtype("O", metadata={"vlen": str}) * use filter.codec_id instead of private filter._meta as suggested in review * update comment and add whats-new.rst entry * fix whats-new.rst * fix whats-new.rst (missing dot) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kai Mühlbauer <[email protected]> Co-authored-by: Kai Mühlbauer <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>

CF encoding should preserve vlen dtype for empty arrays

74dd88c

github-actions bot added the topic-CF conventions label May 22, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

8238000

for more information, see https://pre-commit.ci

dcherian requested a review from kmuehlbauer May 23, 2023 19:15

kmuehlbauer mentioned this pull request May 24, 2023

Zarr store array dtype changes for empty object string #7328

Closed

4 tasks

kmuehlbauer added the needs discussion label May 24, 2023

kmuehlbauer mentioned this pull request May 24, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

preserve vlen string dtype in netcdf4 and zarr backends

61ada46

github-actions bot added io topic-backends topic-zarr Related to zarr storage library labels May 25, 2023

kmuehlbauer reviewed May 31, 2023

View reviewed changes

xarray/backends/netCDF4_.py Show resolved Hide resolved

xarray/backends/zarr.py Show resolved Hide resolved

xarray/conventions.py Show resolved Hide resolved

xarray/tests/test_conventions.py Show resolved Hide resolved

kmuehlbauer added 2 commits June 1, 2023 14:09

check for h5py-variant ("vlen") in coding.strings.check_vlen_dtype

b7d4cf5

add test to check preserving vlen dtype for empty vlen string arrays

a56108e

dcherian requested a review from jhamman June 2, 2023 14:35

tomwhite mentioned this pull request Jun 5, 2023

vcf_to_zarr creates zero-sized first chunk which results in incorrect dtype. sgkit-dev/sgkit#1090

Closed

kmuehlbauer added 2 commits June 6, 2023 11:13

ignore call_overload error for np.dtype("O", metadata={"vlen": str})

ee3dac5

Merge branch 'main' into preserve-vlen-dtype

d08a992

use filter.codec_id instead of private filter._meta as suggested in r…

2d7ed90

…eview

jhamman reviewed Jun 11, 2023

View reviewed changes

xarray/backends/netCDF4_.py Show resolved Hide resolved

kmuehlbauer added 4 commits June 12, 2023 08:10

update comment and add whats-new.rst entry

d0cea44

Merge remote-tracking branch 'origin/main' into preserve-vlen-dtype

e75c6ff

fix whats-new.rst

8afa51c

fix whats-new.rst (missing dot)

8428a8a

kmuehlbauer added plan to merge Final call for comments and removed needs discussion labels Jun 12, 2023

Merge branch 'main' into preserve-vlen-dtype

334802c

dcherian enabled auto-merge (squash) June 16, 2023 03:20

dcherian merged commit 0c876e4 into pydata:main Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF encoding should preserve vlen dtype for empty arrays #7862

CF encoding should preserve vlen dtype for empty arrays #7862

tomwhite commented May 22, 2023 •

edited by dcherian

Loading

kmuehlbauer commented May 24, 2023

tomwhite commented May 24, 2023

kmuehlbauer commented May 24, 2023 •

edited

Loading

kmuehlbauer commented May 24, 2023

tomwhite commented May 24, 2023

kmuehlbauer commented May 24, 2023

tomwhite commented May 24, 2023

kmuehlbauer left a comment

kmuehlbauer commented Jun 1, 2023

tomwhite commented Jun 2, 2023

dcherian commented Jun 2, 2023 •

edited

Loading

headtr1ck commented Jun 2, 2023 •

edited

Loading

kmuehlbauer commented Jun 6, 2023

kmuehlbauer commented Jun 6, 2023

headtr1ck commented Jun 6, 2023

kmuehlbauer commented Jun 6, 2023

jhamman left a comment

kmuehlbauer commented Jun 16, 2023

tomwhite commented Jun 16, 2023

CF encoding should preserve vlen dtype for empty arrays #7862

CF encoding should preserve vlen dtype for empty arrays #7862

Conversation

tomwhite commented May 22, 2023 • edited by dcherian Loading

kmuehlbauer commented May 24, 2023

tomwhite commented May 24, 2023

kmuehlbauer commented May 24, 2023 • edited Loading

kmuehlbauer commented May 24, 2023

tomwhite commented May 24, 2023

kmuehlbauer commented May 24, 2023

tomwhite commented May 24, 2023

kmuehlbauer left a comment

Choose a reason for hiding this comment

kmuehlbauer commented Jun 1, 2023

tomwhite commented Jun 2, 2023

dcherian commented Jun 2, 2023 • edited Loading

headtr1ck commented Jun 2, 2023 • edited Loading

kmuehlbauer commented Jun 6, 2023

kmuehlbauer commented Jun 6, 2023

headtr1ck commented Jun 6, 2023

kmuehlbauer commented Jun 6, 2023

jhamman left a comment

Choose a reason for hiding this comment

kmuehlbauer commented Jun 16, 2023

tomwhite commented Jun 16, 2023

tomwhite commented May 22, 2023 •

edited by dcherian

Loading

kmuehlbauer commented May 24, 2023 •

edited

Loading

dcherian commented Jun 2, 2023 •

edited

Loading

headtr1ck commented Jun 2, 2023 •

edited

Loading