Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

Closed
etienneschalk opened this issue Nov 11, 2023 · 10 comments
Closed
Labels
IO Representation of particular file formats as trees

Comments

@etienneschalk
Copy link
Contributor

Hi,

I noticed a discrepency between the behaviour of xarray's open_zarr and datatree's open_datatree with engine='zarr'.

I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)

To summarize:

Actual:

  • xarray's open_zarr
    • No chunks kwarg: Stored chunks are used.
    • With chunks='auto': Stored chunks are used.
  • datatree's open_datatree with engine='zarr'
    • No chunks kwarg: No chunking performed.
    • With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset

Expected:

I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto', the stored chunks were to be used.

Thanks!

@TomNicholas TomNicholas added the IO Representation of particular file formats as trees label Nov 13, 2023
@TomNicholas
Copy link
Member

Thanks for raising this @etienneschalk !

open_datatree internally calls xarray.open_dataset (not xarray.open_zarr) but there are currently some differences between xarray.open_dataset and xarray.open_zarr. Does the behaviour of open_datatree differ from xarray.open_dataset?

Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.

cc @jhamman

@eschalkargans
Copy link

Hello @TomNicholas ,

Does the behaviour of open_datatree differ from xarray.open_dataset?

After testing, the behaviour of open_datatree is indeed identical to xarray's open_dataset:

  • xarray's open_dataset
    • No chunks kwarg: No chunking is performed. Stored chunks are used.
    • With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset Stored chunks are used.

which is consistent with your statement:

open_datatree internally calls xarray.open_dataset

In that case, I would suggest to:

  • Keep the behaviour of open_datatree, as it uses and is of the same family as open_dataset (open_{xarray's data structure} syntax)
  • ▶️ Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too. And updating the documentation of open_datatree to nudge users into using open_zarr instead if they want to use zarr

Do you think this would be a good idea?

Thanks!

@TomNicholas
Copy link
Member

Thank you for testing!

Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.

Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too.

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

@etienneschalk
Copy link
Contributor Author

Hi @TomNicholas

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

So, this means, while this discussion is not settled, implementing an open_zarr in datatree might be a waste of effort, if I understand correctly, in the case where open_zarr would be integrated into the open_dataset. However, if this happens in upstream xarray, the correct behaviour of open_zarr should be kept, not the one of the existing open_dataset that does not handle chunks properly.

Do you have a link to this discussion, by any chance? I would be interested to learn more about this.

Thanks, have a nice day!

@keewis
Copy link
Contributor

keewis commented Dec 12, 2023

the difference between open_zarr and open_dataset is that for open_zarr "auto" (the default) translates to {} if a chunk manager is available (like dask or cubed) or None otherwise, which are then forwarded to open_dataset. For open_dataset, the default is None (no chunks), while {} is the same as for open_zarr and "auto" is dask's auto-chunking (see dask.array.Array.rechunk for more details). So in summary, open_zarr is a wrapper of open_dataset, with a different default for chunks and a different meaning for "auto".

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})

@TomNicholas
Copy link
Member

Thanks @keewis .

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?

@keewis
Copy link
Contributor

keewis commented Dec 12, 2023

I think the "deprecate open_zarr" issue should be fine to reuse for this: pydata/xarray#7495

@etienneschalk
Copy link
Contributor Author

Thanks for the chunks={} tip! This is indeed the behaviour I expected.

This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.

I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg 🆗"

@eni-awowale
Copy link

I tested this locally with a different example and got the same results. Here is reproducible example:

# Test data
set1_data = xr.Dataset({"a": 0, "b": 1})
set2_data = xr.Dataset({"a": ("x", [2, 3]), "b": ("x", [0.1, 0.2])})
root_data = xr.Dataset({"a": ("y", [6, 7, 8]), "set0": ("x", [9, 10])})

# Write to zarr
root_data.to_zarr('/data_samples/simple_datatree_aligned.zarr')
set1_data.to_zarr('/data_samples/simple_datatree_aligned.zarr', group='set1', mode='a')
set2_data.to_zarr('/data_samples/simple_datatree_aligned.zarr', group='set2', mode='a')
In [34]: api.open_dataset('./data_samples/simple_datatree_aligned.zarr', engine='zarr', chunks={}).chunksizes
Out[34]: Frozen({'y': (3,), 'x': (2,)})
In [34]: api.open_dataset('./data_samples/simple_datatree_aligned.zarr', engine='zarr', chunks='auto').chunksizes
Out[34]: Frozen({'y': (3,), 'x': (2,)})
In [34]: api.open_dataset('./data_samples/simple_datatree_aligned.zarr', engine='zarr').chunksizes
Out[34]: Frozen({})

If this is the expected result can we close this issue?

@TomNicholas
Copy link
Member

TomNicholas commented Sep 16, 2024

Thanks for looking into this rabbit hole @eni-awowale ! I think we're getting a bit off track here though.

  1. From now on we should only bother discussing issues that exist for the upstream xarray.DataTree implementation.
  2. We have not even implemented xr.DataTree support in xr.open_zarr yet - it will currently always return a Dataset (we could potentially make an upstream issue to track this).
  3. That means by definition the original issue raised here does not exist for upstream xarray (yet).
  4. Justus' comment explains the difference in behaviour between chunks={}/None/'auto' in open_zarr vs open_dataset anyway, and that is already known about and tracked upstream.
  5. Nevertheless we do separately want to ensure that xr.open_dataset(..., chunks=X) behaves the same way as xr.open_datatree(..., chunks=X). That should already be the case, but if its not then that deserves a new issue upstream.
  6. I actually can't even run your example locally @eni-awowale without encountering an error in .to_zarr. But that seems orthogonal to everything else here.

I suggest we close this issue and as we notice problems we raise dedicated new issues on the upstream repo.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
IO Representation of particular file formats as trees
Projects
None yet
Development

No branches or pull requests

5 participants