When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

etienneschalk · 2023-11-11T15:08:50Z

Hi,

I noticed a discrepency between the behaviour of xarray's open_zarr and datatree's open_datatree with engine='zarr'.

I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)

To summarize:

Actual:

xarray's open_zarr
- No chunks kwarg: Stored chunks are used.
- With chunks='auto': Stored chunks are used.
datatree's open_datatree with engine='zarr'
- No chunks kwarg: No chunking performed.
- With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset

Expected:

I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto', the stored chunks were to be used.

Thanks!

The text was updated successfully, but these errors were encountered:

TomNicholas · 2023-11-13T17:45:20Z

Thanks for raising this @etienneschalk !

open_datatree internally calls xarray.open_dataset (not xarray.open_zarr) but there are currently some differences between xarray.open_dataset and xarray.open_zarr. Does the behaviour of open_datatree differ from xarray.open_dataset?

Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.

cc @jhamman

eschalkargans · 2023-11-22T16:41:51Z

Hello @TomNicholas ,

Does the behaviour of open_datatree differ from xarray.open_dataset?

After testing, the behaviour of open_datatree is indeed identical to xarray's open_dataset:

xarray's open_dataset
- No chunks kwarg: No chunking is performed. ~~Stored chunks are used.~~
- With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset ~~Stored chunks are used.~~

which is consistent with your statement:

open_datatree internally calls xarray.open_dataset

In that case, I would suggest to:

Keep the behaviour of open_datatree, as it uses and is of the same family as open_dataset (open_{xarray's data structure} syntax)
▶️ Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too. And updating the documentation of open_datatree to nudge users into using open_zarr instead if they want to use zarr

Do you think this would be a good idea?

Thanks!

TomNicholas · 2023-11-27T17:02:07Z

Thank you for testing!

Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.

Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too.

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

etienneschalk · 2023-12-02T10:51:07Z

Hi @TomNicholas

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

So, this means, while this discussion is not settled, implementing an open_zarr in datatree might be a waste of effort, if I understand correctly, in the case where open_zarr would be integrated into the open_dataset. However, if this happens in upstream xarray, the correct behaviour of open_zarr should be kept, not the one of the existing open_dataset that does not handle chunks properly.

Do you have a link to this discussion, by any chance? I would be interested to learn more about this.

Thanks, have a nice day!

keewis · 2023-12-12T14:33:16Z

the difference between open_zarr and open_dataset is that for open_zarr "auto" (the default) translates to {} if a chunk manager is available (like dask or cubed) or None otherwise, which are then forwarded to open_dataset. For open_dataset, the default is None (no chunks), while {} is the same as for open_zarr and "auto" is dask's auto-chunking (see dask.array.Array.rechunk for more details). So in summary, open_zarr is a wrapper of open_dataset, with a different default for chunks and a different meaning for "auto".

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})

TomNicholas · 2023-12-12T14:48:37Z

Thanks @keewis .

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?

keewis · 2023-12-12T14:52:33Z

I think the "deprecate open_zarr" issue should be fine to reuse for this: pydata/xarray#7495

etienneschalk · 2024-02-09T18:31:49Z

Thanks for the chunks={} tip! This is indeed the behaviour I expected.

This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.

I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg 🆗"

eni-awowale · 2024-09-16T20:58:38Z

I tested this locally with a different example and got the same results. Here is reproducible example:

# Test data
set1_data = xr.Dataset({"a": 0, "b": 1})
set2_data = xr.Dataset({"a": ("x", [2, 3]), "b": ("x", [0.1, 0.2])})
root_data = xr.Dataset({"a": ("y", [6, 7, 8]), "set0": ("x", [9, 10])})

# Write to zarr
root_data.to_zarr('/data_samples/simple_datatree_aligned.zarr')
set1_data.to_zarr('/data_samples/simple_datatree_aligned.zarr', group='set1', mode='a')
set2_data.to_zarr('/data_samples/simple_datatree_aligned.zarr', group='set2', mode='a')

In [34]: api.open_dataset('./data_samples/simple_datatree_aligned.zarr', engine='zarr', chunks={}).chunksizes
Out[34]: Frozen({'y': (3,), 'x': (2,)})

In [34]: api.open_dataset('./data_samples/simple_datatree_aligned.zarr', engine='zarr', chunks='auto').chunksizes
Out[34]: Frozen({'y': (3,), 'x': (2,)})

In [34]: api.open_dataset('./data_samples/simple_datatree_aligned.zarr', engine='zarr').chunksizes
Out[34]: Frozen({})

If this is the expected result can we close this issue?

TomNicholas · 2024-09-16T23:24:14Z

Thanks for looking into this rabbit hole @eni-awowale ! I think we're getting a bit off track here though.

From now on we should only bother discussing issues that exist for the upstream xarray.DataTree implementation.
We have not even implemented xr.DataTree support in xr.open_zarr yet - it will currently always return a Dataset (we could potentially make an upstream issue to track this).
That means by definition the original issue raised here does not exist for upstream xarray (yet).
Justus' comment explains the difference in behaviour between chunks={}/None/'auto' in open_zarr vs open_dataset anyway, and that is already known about and tracked upstream.
Nevertheless we do separately want to ensure that xr.open_dataset(..., chunks=X) behaves the same way as xr.open_datatree(..., chunks=X). That should already be the case, but if its not then that deserves a new issue upstream.
I actually can't even run your example locally @eni-awowale without encountering an error in .to_zarr. But that seems orthogonal to everything else here.

I suggest we close this issue and as we notice problems we raise dedicated new issues on the upstream repo.

TomNicholas added the IO Representation of particular file formats as trees label Nov 13, 2023

TomNicholas closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

etienneschalk commented Nov 11, 2023

TomNicholas commented Nov 13, 2023

Uh oh!

eschalkargans commented Nov 22, 2023

Uh oh!

TomNicholas commented Nov 27, 2023

Uh oh!

etienneschalk commented Dec 2, 2023

Uh oh!

keewis commented Dec 12, 2023 •

edited

Loading

Uh oh!

TomNicholas commented Dec 12, 2023

Uh oh!

keewis commented Dec 12, 2023

Uh oh!

etienneschalk commented Feb 9, 2024

Uh oh!

eni-awowale commented Sep 16, 2024

Uh oh!

TomNicholas commented Sep 16, 2024 •

edited

Loading

Uh oh!

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

Comments

etienneschalk commented Nov 11, 2023

TomNicholas commented Nov 13, 2023

Uh oh!

eschalkargans commented Nov 22, 2023

Uh oh!

TomNicholas commented Nov 27, 2023

Uh oh!

etienneschalk commented Dec 2, 2023

Uh oh!

keewis commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Dec 12, 2023

Uh oh!

keewis commented Dec 12, 2023

Uh oh!

etienneschalk commented Feb 9, 2024

Uh oh!

eni-awowale commented Sep 16, 2024

Uh oh!

TomNicholas commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keewis commented Dec 12, 2023 •

edited

Loading

TomNicholas commented Sep 16, 2024 •

edited

Loading