Allow "unsafe" mode for zarr writing #5056

rabernat · 2021-03-19T21:57:47Z

Curently, Dataset.to_zarr will only write Zarr datasets in cases in which

The Dataset arrays are in memory (no dask)
The arrays are chunked with dask with a one-to-many relationship between dask chunks and zarr chunks

If I try to violate the one-to-many condition, I get an error

import xarray as xr
ds = xr.DataArray([0, 1., 2], name='foo').chunk({'dim_0': 1}).to_dataset()
d = ds.to_zarr('test.zarr', encoding={'foo': {'chunks': (3,)}}, compute=False)

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name)
    148             for dchunk in dchunks[:-1]:
    149                 if dchunk % zchunk:
--> 150                     raise NotImplementedError(
    151                         f"Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for "
    152                         f"variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. "

NotImplementedError: Specified zarr chunks encoding['chunks']=(3,) for variable named 'foo' would overlap multiple dask chunks ((1, 1, 1),). This is not implemented in xarray yet. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.

In this case, the error is particularly frustrating because I'm not even writing any data yet. (Also related to #2300, #4046, #4380).

There are at least two scenarios in which we might want to have more flexibility.

The case above, when we want to lazily initialize a Zarr array based on a Dataset, without actually computing anything.
The more general case, where we actually write arrays with many-to-many dask-chunk <-> zarr-chunk relationships

For 1, I propose we add a new option like safe_chunks=True to to_zarr. safe_chunks=False would permit just bypassing this chunk.

For 2, we could consider implementing locks. This probably has to be done at the Dask level. But is actually not super hard to deterministically figure out which chunks need to share a lock.

The text was updated successfully, but these errors were encountered:

shoyer · 2021-03-24T19:29:13Z

These both sound fine to me.

So far, I've been happy working around (1) by constructing synthetic dask arrays with the desired final chunks. I suspect that's even pretty efficient on the dask side, as long as everything uses Dask's HighLevelGraph for representing the underlying tasks.

This was referenced Mar 21, 2021

Cache input metadata pangeo-forge/pangeo-forge-recipes#78

Merged

Zarr chunking fixes #5065

Merged

zarr and xarray chunking compatibility and to_zarr performance #2300

Closed

dcherian added the topic-zarr Related to zarr storage library label Apr 24, 2021

dcherian closed this as completed in #5065 Apr 26, 2021

rabernat mentioned this issue Apr 28, 2021

Zarr encoding attributes persist after slicing data, raising error on to_zarr #5219

Open

eric-czech mentioned this issue May 10, 2021

Zarr chunks would overlap multiple dask chunks error #5286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow "unsafe" mode for zarr writing #5056

Allow "unsafe" mode for zarr writing #5056

rabernat commented Mar 19, 2021

shoyer commented Mar 24, 2021

Allow "unsafe" mode for zarr writing #5056

Allow "unsafe" mode for zarr writing #5056

Comments

rabernat commented Mar 19, 2021

shoyer commented Mar 24, 2021