-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
zarr and xarray chunking compatibility and to_zarr
performance
#2300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just pushed a new xarray release (0.10.8) earlier today. We had a fix for zarr chunking in there (#2228) -- does that solve your issue? |
Ah, that's great. I do see some improvement. Specifically, I can now set chunks using xarray, and successfully write to zarr, and reopen it. However, when reopening it I do find that the chunks have been inconsistently applied (some fields have the expected chunksize whereas some small fields have the entire variable in one chunk). Furthermore, trying to write a second time with I also tried loading my entire dataset into memory, allowing the initial Curious: Is there any downside in xarray to using datasets with inconsistent chunks? I take it that it is a supported configuration because xarray allows it to happen, but just outputs that error when calling One other thing to add: it might be nice to have an option to allow zarr auto-chunking even when |
No, there's no downside here. It's just not possible to define a single dict of chunks in this case. Can you look into the It would also help to come up with a self-contained example that reproduces this using dummy data. |
I took a closer look and noticed my one-dimensional fields of size 505359 were reporting a chunksize or 63170. Turns out that's enough to come up with a minimal repro: >>> xr.__version__
'0.10.8'
>>> ds=xr.Dataset({'foo': (['bar'], np.zeros((505359,)))})
>>> ds.to_zarr('test.zarr')
<xarray.backends.zarr.ZarrStore object at 0x7fd9680f7fd0>
>>> ds2=xr.open_zarr('test.zarr')
>>> ds2
<xarray.Dataset>
Dimensions: (bar: 505359)
Dimensions without coordinates: bar
Data variables:
foo (bar) float64 dask.array<shape=(505359,), chunksize=(63170,)>
>>> ds2.foo.encoding
{'chunks': (63170,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters':
None, '_FillValue': nan, 'dtype': dtype('float64')}
>>> ds2.to_zarr('test2.zarr') raises
|
Hi, I'm new to xarray & zarr , Then why do I get 'notimplemented error ? Do I have to use 'del dsread.data.encoding['chunks']. each time before using 'Dataset.to_zarr' as a workaround? but probably I am missing somthing. I hope someone can point me out... I made a notebook here for reproducing the pb. thanks for your help, regards Tina |
I am getting the same error too. |
Hi all. I am looking into this issue, trying to figure out if it is still a thing. I just tried @chrisbarber's MRE above using xarray v 0.15. import xarray as xr
ds=xr.Dataset({'foo': (['bar'], np.zeros((505359,)))})
ds.to_zarr('test.zarr', mode='w')
ds2=xr.open_zarr('test.zarr')
ds2.to_zarr('test2.zarr', mode='w') This now works without error, thanks to #2487. I can trigger the error in a third step: ds3 = ds2.chunk({'bar': 10000})
ds3.to_zarr('test3.zarr', mode='w') raises
The problem is that, even though we rechunked the data, >>> print(ds3.foo.encoding)
{'chunks': (63170,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')} This was populated with the variable was read from As a workaround, you can delete the encoding (either just the ds3.foo.encoding = {}
ds3.to_zarr('test3.zarr', mode='w') This allows the operation to complete successfully. For all the users stuck on this problem (e.g. @abarciauskas-bgse):
For xarray developers, the question is whether the |
IMO this is the user-friendly thing to do. |
If there is a non-dimension coordinate, the error is also tickled.
|
I arrived here due to a different use case / problem, which ultimately I solved, but I think there's value in documenting it here. import shutil
import xarray as xr
import numpy as np
import tempfile
zarr_path = tempfile.mkdtemp()
def append_test(ds,chunks):
shutil.rmtree(zarr_path)
for i in range(21):
d = ds.isel(frame=slice(i,i+1))
d = d.chunk(chunks)
d.to_zarr(zarr_path,consolidated=True,**(dict(mode='a',append_dim='frame') if i>0 else {}))
dsa = xr.open_zarr(str(zarr_path),consolidated=True)
print(dsa.chunks,dsa.dims)
#sometime before 0.16.0
import contextlib
@contextlib.contextmanager
def change_determine_zarr_chunks(chunks):
orig_determine_zarr_chunks = xr.backends.zarr._determine_zarr_chunks
try:
def new_determine_zarr_chunks( enc_chunks, var_chunks, ndim, name):
da = ds[name]
zchunks = tuple(chunks[dim] if (dim in chunks and chunks[dim] is not None) else da.shape[i] for i,dim in enumerate(da.dims))
return zchunks
xr.backends.zarr._determine_zarr_chunks = new_determine_zarr_chunks
yield
finally:
xr.backends.zarr._determine_zarr_chunks = orig_determine_zarr_chunks
chunks = {'frame':10,'other':50}
ds = xr.Dataset({'data':xr.DataArray(data=np.random.rand(100,100),dims=('frame','other'))})
append_test(ds,chunks)
with change_determine_zarr_chunks(chunks):
append_test(ds,chunks)
#with 0.16.0
def append_test_encoding(ds,chunks):
shutil.rmtree(zarr_path)
encoding = {}
for k,v in ds.variables.items():
encoding[k]={'chunks':tuple(chunks[dk] if dk in chunks else v.shape[i] for i,dk in enumerate(v.dims))}
for i in range(21):
d = ds.isel(frame=slice(i,i+1))
d = d.chunk(chunks)
d.to_zarr(zarr_path,consolidated=True,**(dict(mode='a',append_dim='frame') if i>0 else dict(encoding = encoding)))
dsa = xr.open_zarr(str(zarr_path),consolidated=True)
print(dsa.chunks,dsa.dims)
append_test_encoding(ds,chunks)
|
Just ran into this issue myself and just wanted to add a +1 to stripping the encoding when |
I think we are all in agreement. Just waiting for someone to make a PR. It's probably just a few lines of code changes. |
alternatively |
I would not favor that. A user may choose to define their desired zarr chunks by putting this information in encoding. In this case, it's good to raise the error. (This is the case I had in mind when I wrote this code.) The problem here is that encoding is often being carried over from the original dataset and persisted across operations that change chunk size. |
In #5056, I have implemented the solution of deleting |
I have a situation where I build large zarr arrays based on chunks which correspond to how I am reading data off a filesystem, for best I/O performance. Then I set these as variables on an xarray dataset which I want to persist to zarr, but with different chunks more optimal for querying.
One problem I ran into is that manually selecting chunks of a dataset prior to
to_zarr
results inxarray/xarray/backends/zarr.py
Line 83 in 66be9c5
It's difficult for me to understand exactly how to select chunks manually at the dataset level which would also make this zarr "final chunk" constraint happy. I would have been satisfied however with letting zarr choose chunks for me, but could not find a way to trigger this through the xarray API short of "unchunking" it first, which would lead to loading entire variables into memory. I came up with the following hack to trigger zarr's automatic chunking despite having differently defined chunks on my xarray dataset:
The next problem to contend with is that
da.store
between zarr stores with differing chunks between source and destination is astronomically slow. The first thing to attempt would be to rechunk the dask arrays according to the destination zarr chunks, but xarray's consistent chunks constraint blocks this strategy as far as I can tell. Once again I took the dirty hack approach and inject a rechunking on a per-variable basis during theto_zarr
operation, as follows:I may have missed something in the API that would have made this easier, or another workaround which would be less hacky, but in any case I'm wondering if this scenario could be handled elegantly in xarray.
I'm not sure if there is a plan going forward to make legal xarray chunks 100% compatible with zarr; if so that would go a fair ways in alleviating the first problem. Alternatively, perhaps the xarray API could expose some ability to adjust chunks according to zarr's liking, as well as the option of defaulting entirely to zarr's heuristics for chunking.
As for the performance issue with differing chunks, I'm not sure whether my rechunking patch could be applied without causing side-effects. Or where the right place to solve this would be-- perhaps it could be more naturally addressed within
da.store
.The text was updated successfully, but these errors were encountered: