Inconclusive error messages using to_zarr with regions #5290

niowniow · 2021-05-11T15:54:39Z

What happened:
The idea is to use a xarray dataset (stored as dummy zarr file), which is subsequently filled with the region argument, as explained in the documentation. Ideally, almost nothing is stored to disk upfront.

It seems the current implementation is only designed to either store coordinates for the whole dataset and write them to disk or to write without coordinates. I failed to understand this from the documentation and tried to create a dataset without coordinates and fill it with a dataset subset with coordinates. It gave some inconclusive errors depending on the actual code example (see below).
ValueError: parameter 'value': expected array with shape (0,), got (10,) or ValueError: conflicting sizes for dimension 'x': length 10 on 'x' and length 30 on 'foo'

It might also be a bug and it should in fact be possible to add a dataset with coordinates to a dummy dataset without coordinates. Then there seems to be an issue regarding the handling of the variables during storing the region.

... or I might just have done it wrong... and I'm looking forward to suggestions.

What you expected to happen:

Either an error message telling me that that i should use coordinates during creation of the dummy dataset. Alternatively, if this is a bug and should be possible then it should just work.

Minimal Complete Verifiable Example:

import dask.array
import xarray as xr
import numpy as np

error = 1 # choose between 0 (no error), 1, 2, 3

dummies = dask.array.zeros(30, chunks=10)

# chunks in coords are not taken into account while saving!?
coord_x = dask.array.zeros(30, chunks=10) # or coord_x = np.zeros((30,))
if error == 0:
    ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x":coord_x})
else:
    ds = xr.Dataset({"foo": ("x", dummies)})

print(ds)
path = "./tmp/test.zarr"
ds.to_zarr(path, mode='w', compute=False, consolidated=True)

# create a new dataset to be input into a region
ds = xr.Dataset({"foo": ('x', np.arange(10))},coords={"x":np.arange(10)})

if error == 1:
    ds.to_zarr(path, region={"x": slice(10, 20)})
    # ValueError: parameter 'value': expected array with shape (0,), got (10,)
elif error == 2:
    ds.to_zarr(path, region={"x": slice(0, 10)})
    ds.to_zarr(path, region={"x": slice(10, 20)})
    # ValueError: conflicting sizes for dimension 'x': length 10 on 'x' and length 30 on 'foo'
elif error == 3:
    ds.to_zarr(path, region={"x": slice(0, 10)})
    ds = xr.Dataset({"foo": ('x', np.arange(10))},coords={"x":np.arange(10)})
    ds.to_zarr(path, region={"x": slice(10, 20)})
    # ValueError: parameter 'value': expected array with shape (0,), got (10,)
else:
    ds.to_zarr(path, region={"x": slice(10, 20)})

ds = xr.open_zarr(path)
print('reopen',ds['x'])

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.18.0
pandas: 1.2.3
numpy: 1.19.2
scipy: 1.6.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.04.0
distributed: None
matplotlib: 3.4.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: None
pytest: None
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2021-05-11T19:45:01Z

Hi @niowniow thanks for the feedback and code example here. I've been refactoring the Zarr region re-write functionality in #5252 so your feedback is timely.

It might be worth trying the code in that PR to see if it changes anything, but in the best case I suspect it would just give you a different error message.

To clarify:

Writing to a region requires that all the desired variables already exist in the Zarr store (e.g., via to_zarr() with compute=False). The main point of writing to a region is that it should be safe to do in parallel, which is not the case for creating new variables. So I think encountering errors in this case is expected, although I agree these error messages are not very informative!
"chunks in coords are not taken into account while saving!?" is correct, but this is actually more a limitation of Xarray's current data model. Coords with the same name as their dimension are converted into an in-memory pandas.Index object -- they can't be stored as dask arrays. You actually still write the data in chunks, but you have to do it by supplying the encoding parameter to to_zarr, e.g., ds.to_zarr(path, mode='w', encoding={'x': {'chunks': 10}}, compute=False, consolidated=True).

If you have any specific suggestions for where the docs might be clarified those would certainly be appreciated!

niowniow · 2021-05-12T13:08:51Z

Thanks a lot! Very helpful comments. I'll check out your PR. If i understand it correct, zarr does some autochunking while saving coordinates even without setting specific encodings, at least for bigger coordinate arrays. I can get what I want by creating a zarr store with compute=False then deleting everything except the metadata manually on the filesystem level. Then each call to_zarr() with region results in only one coordinate chunk being created on disk. Reading with xr.open_zarr() works as expected: the coordinate contains nan except for the region written before. The (potentially very large) coordinate still needs to fit in memory though... either when creating the dummy zarr store (which could be done differently) or when opening it. Is that correct? That wont work for my use case when the coordinate is very large. Do you know an alternative? Would it help if I store the coordinate with a non-dimension name? i guess it all boils down to the way xarray recreates the Dataset from zarr store. The only way I can think of right know to make useful "chunked indices" are some form of hierachical indexing. Each chunk is represented by the first index in that chunk. Which would probably only work for sequential indices. I dont know if such indexing exists for pandas. Maybe a hierachical chunking could be useful for some very large datasets!? I dont know if that would create too much overhead but it would be a structured way to access long-term high-res data. In a way I think thats what I'm trying to implement. I would be happy about any pointers to existing solutions. Regarding the documentation: I could provide an example with a time coordinate, which would illustrate two issues I encountered. * region requires index space coordinates (I know: it's already explained in the docs... :) * the before mentioned "coordinates need to be predefined" issue. (Sorry if this bug report is not the right place to ask all these questions)

shoyer · 2021-05-21T02:22:15Z

The (potentially very large) coordinate still needs to fit in memory though... either when creating the dummy zarr store (which could be done differently) or when opening it. Is that correct?

This is correct

That wont work for my use case when the coordinate is very large. Do you know an alternative? Would it help if I store the coordinate with a non-dimension name?

Yes, this would work.

We do probably want to change this behavior in the future as part of the changes related to #1603, e.g., to support out of core indexing. See also https://github.com/pydata/xarray/blob/master/design_notes/flexible_indexes_notes.md

max-sixty · 2023-10-24T23:11:50Z

Aside from the big out-of-core indexing feature in #1603, is there anything left to do here or should we close?

dcherian added the topic-zarr Related to zarr storage library label Apr 9, 2022

max-sixty mentioned this issue Oct 3, 2023

Improve error messages #8264

Open

max-sixty added the plan to close May be closeable, needs more eyeballs label Oct 24, 2023

max-sixty closed this as completed Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconclusive error messages using to_zarr with regions #5290

Inconclusive error messages using to_zarr with regions #5290

niowniow commented May 11, 2021

INSTALLED VERSIONS

shoyer commented May 11, 2021 •

edited

Loading

niowniow commented May 12, 2021 via email •

edited

Loading

shoyer commented May 21, 2021

max-sixty commented Oct 24, 2023

Inconclusive error messages using to_zarr with regions #5290

Inconclusive error messages using to_zarr with regions #5290

Comments

niowniow commented May 11, 2021

INSTALLED VERSIONS

shoyer commented May 11, 2021 • edited Loading

niowniow commented May 12, 2021 via email • edited Loading

shoyer commented May 21, 2021

max-sixty commented Oct 24, 2023

shoyer commented May 11, 2021 •

edited

Loading

niowniow commented May 12, 2021 via email •

edited

Loading