Skip to content

to_zarr raises ValueError: Invalid dtype with mode='a' (but not with mode='w') #6345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cisaacstern opened this issue Mar 9, 2022 · 6 comments · Fixed by #6476
Closed

to_zarr raises ValueError: Invalid dtype with mode='a' (but not with mode='w') #6345

cisaacstern opened this issue Mar 9, 2022 · 6 comments · Fixed by #6476
Labels
bug topic-zarr Related to zarr storage library

Comments

@cisaacstern
Copy link
Contributor

cisaacstern commented Mar 9, 2022

What happened?

A dataset in which a data variable has dtype='|S35' can be written to zarr without error as follows

import xarray as xr
import numpy as np

data = np.zeros((2, 3), dtype='|S35')
ds = xr.DataArray(data, name='foo').to_dataset()
ds.to_zarr('test.zarr', mode='w')

Changing the value of mode from 'w' to 'a', raises ValueError: Invalid dtype for data variable:

!rm -rf test.zarr
ds.to_zarr('test.zarr', mode='a')
Full Traceback
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 ds.to_zarr('test.zarr', mode='a')

File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/core/dataset.py:2036, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
   2033 if encoding is None:
   2034     encoding = {}
-> 2036 return to_zarr(
   2037     self,
   2038     store=store,
   2039     chunk_store=chunk_store,
   2040     storage_options=storage_options,
   2041     mode=mode,
   2042     synchronizer=synchronizer,
   2043     group=group,
   2044     encoding=encoding,
   2045     compute=compute,
   2046     consolidated=consolidated,
   2047     append_dim=append_dim,
   2048     region=region,
   2049     safe_chunks=safe_chunks,
   2050 )

File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:1406, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
   1391 zstore = backends.ZarrStore.open_group(
   1392     store=mapper,
   1393     mode=mode,
   (...)
   1402     stacklevel=4,  # for Dataset.to_zarr()
   1403 )
   1405 if mode in ["a", "r+"]:
-> 1406     _validate_datatypes_for_zarr_append(dataset)
   1407     if append_dim is not None:
   1408         existing_dims = zstore.get_dimensions()

File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:1301, in _validate_datatypes_for_zarr_append(dataset)
   1292         raise ValueError(
   1293             "Invalid dtype for data variable: {} "
   1294             "dtype must be a subtype of number, "
   (...)
   1297             "object".format(var)
   1298         )
   1300 for k in dataset.data_vars.values():
-> 1301     check_dtype(k)

File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:1292, in _validate_datatypes_for_zarr_append.<locals>.check_dtype(var)
   1283 def check_dtype(var):
   1284     if (
   1285         not np.issubdtype(var.dtype, np.number)
   1286         and not np.issubdtype(var.dtype, np.datetime64)
   (...)
   1290     ):
   1291         # and not re.match('^bytes[1-9]+$', var.dtype.name)):
-> 1292         raise ValueError(
   1293             "Invalid dtype for data variable: {} "
   1294             "dtype must be a subtype of number, "
   1295             "datetime, bool, a fixed sized string, "
   1296             "a fixed size unicode string or an "
   1297             "object".format(var)
   1298         )

ValueError: Invalid dtype for data variable: <xarray.DataArray 'foo' (dim_0: 2, dim_1: 3)>
array([[b'', b'', b''],
       [b'', b'', b'']], dtype='|S35')
Dimensions without coordinates: dim_0, dim_1 dtype must be a subtype of number, datetime, bool, a fixed sized string, a fixed size unicode string or an object

What did you expect to happen?

I would expect the behavior of mode='w' and mode='a' to be consistent as regards dtypes of data variables.

Minimal Complete Verifiable Example

See What Happened? section above

Relevant log output

See What Happened? section above

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:28:27) 
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 21.0.1
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.4.1
numpy: 1.22.2
scipy: 1.8.0
netCDF4: 1.5.8
pydap: installed
h5netcdf: 999
h5py: 3.6.0
Nio: None
zarr: 2.11.0
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: 0.9.8.5
iris: None
bottleneck: None
dask: 2022.02.1
distributed: 2022.2.1
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.02.0
cupy: None
pint: None
sparse: None
setuptools: 59.8.0
pip: 22.0.4
conda: None
pytest: 6.2.5
IPython: 8.1.1
sphinx: None

cc @rabernat

@cisaacstern cisaacstern added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 9, 2022
@dcherian dcherian added needs triage Issue that has not been reviewed by xarray team member topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 9, 2022
@rabernat
Copy link
Contributor

rabernat commented Mar 9, 2022

The relevant code is here

xarray/xarray/backends/api.py

Lines 1405 to 1406 in d293f50

if mode in ["a", "r+"]:
_validate_datatypes_for_zarr_append(dataset)

and here

xarray/xarray/backends/api.py

Lines 1280 to 1298 in d293f50

def _validate_datatypes_for_zarr_append(dataset):
"""DataArray.name and Dataset keys must be a string or None"""
def check_dtype(var):
if (
not np.issubdtype(var.dtype, np.number)
and not np.issubdtype(var.dtype, np.datetime64)
and not np.issubdtype(var.dtype, np.bool_)
and not coding.strings.is_unicode_dtype(var.dtype)
and not var.dtype == object
):
# and not re.match('^bytes[1-9]+$', var.dtype.name)):
raise ValueError(
"Invalid dtype for data variable: {} "
"dtype must be a subtype of number, "
"datetime, bool, a fixed sized string, "
"a fixed size unicode string or an "
"object".format(var)
)

What I don't understand is why different validation is needed for the append scenario than for the the write scenario. @shoyer worked on this in #5252, so maybe he has some ideas.

@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Mar 9, 2022
@kmsampson
Copy link

I just ran into this today as well. I am trying to add a dimensionless variable to an existing Zarr store, to help with CF-compliance (if exporting to netCDF), and I ran into this issue. The dtype of my variable is '|S1', and the error message is printed below:

ValueError: Invalid dtype for data variable: <xarray.DataArray 'crs' ()> array(b'', dtype='|S1')

@rabernat
Copy link
Contributor

Thanks for reporting this @kmsampson. My feeling is that it is a bug...which we can hopefully fix pretty easily!

@shoyer
Copy link
Member

shoyer commented Mar 11, 2022

The data type restriction here seems to date back to the original PR adding support for appending. I turned up this comment that seems to summarize the motivation for this check:
#2706 (comment)

I think the original issue was that appending a fixed-width string could be a problem if the fixed-width does not match the width of the existing string dtype stored in Zarr.

This obviously doesn't apply in this case, because you are adding an entirely new variable. So I guess the check could be removed in that case.

@rabernat
Copy link
Contributor

It seems like what we really want to do is verify that the datatype of the appended data matches the data type on disk.

@cisaacstern
Copy link
Contributor Author

cisaacstern commented Mar 11, 2022

So it looks like changing

xarray/xarray/backends/api.py

Lines 1280 to 1301 in d293f50

def _validate_datatypes_for_zarr_append(dataset):
"""DataArray.name and Dataset keys must be a string or None"""
def check_dtype(var):
if (
not np.issubdtype(var.dtype, np.number)
and not np.issubdtype(var.dtype, np.datetime64)
and not np.issubdtype(var.dtype, np.bool_)
and not coding.strings.is_unicode_dtype(var.dtype)
and not var.dtype == object
):
# and not re.match('^bytes[1-9]+$', var.dtype.name)):
raise ValueError(
"Invalid dtype for data variable: {} "
"dtype must be a subtype of number, "
"datetime, bool, a fixed sized string, "
"a fixed size unicode string or an "
"object".format(var)
)
for k in dataset.data_vars.values():
check_dtype(k)

to

def _validate_datatypes_for_zarr_append(store, dataset):
    """DataArray.name and Dataset keys must be a string or None"""

    def check_dtype(vname):
        store_dtype = store.get_variables()[vname].dtype
        dataset_dtype = dataset[vname].dtype
        if not store_dtype == dataset_dtype:
            raise ValueError(
                f"Mismatched dtypes for variable {vname} between Zarr store on disk "
                f"and dataset to append. Store has dtype {store_dtype} but dataset to "
                f"append has dtype {dataset_dtype}."
            )

    for vname in dataset.data_vars:
        check_dtype(vname)

could work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants