Skip to content

phony_dims must be specified when opening HDF5 file without dimension scales #10049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
asteiker opened this issue Feb 14, 2025 · 5 comments · Fixed by #10058
Closed

phony_dims must be specified when opening HDF5 file without dimension scales #10049

asteiker opened this issue Feb 14, 2025 · 5 comments · Fixed by #10058
Labels
topic-backends topic-DataTree Related to the implementation of a DataTree class

Comments

@asteiker
Copy link

What is your issue?

Hello, I am in the xarray.DataTree() fan club(!) and have been starting to update existing tutorials to work NASA ICESat-2 HDF5 data in xarray, transitioning from earlier xarray.open_dataset() guidance that only allowed for a single group to be specified.

I was hoping ICESat-2 files would just open out of the box with datatree now, but I get an error unless I specify phony_dims:

dt = xr.open_datatree(file, phony_dims='sort')
(see full notebook here , or you could download an example file here)

@eni-awowale provided some helpful guidance, and sounds like it may be an issue between interoperability with the hdf5 and netcdf-c library.

Since xarray can already detect phony_dims, we were wondering if this could be a reasonable add to the h5netcdf backend engine. This could greatly streamline HDF5 users' workflow so that files can open out of the box w/o needing to specify additional kwargs.

@asteiker asteiker added the needs triage Issue that has not been reviewed by xarray team member label Feb 14, 2025
Copy link

welcome bot commented Feb 14, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@kmuehlbauer
Copy link
Contributor

@asteiker Thanks for bringing this to attention.

When handling of phony_dims was implemented in h5netcdf to cover for unassigned dimension scales (as netcdf-c/netCDF4-python does) there where two choices.

  • sort - iterate over all groups and assign phony_dims. This would be in line with netcdf-c/netCDF4-python, creating equivalent dimension names.
  • access - assign phony_dims, when actually accessing a certain group

The latter has some real performance improvements (depending on what and how data is read), so the decision was to let the user decide which implementation they want to use.

I'm not sure, if this performance gain still holds true when acquiring the whole DataTree (instead of only a single group). So we might think about setting phony_dims="sort" or phony_dims="access" for open_datatree for the h5netcdf-backend, depending on feedback. I'd gladly review an according PR.

@kmuehlbauer kmuehlbauer added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 14, 2025
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 14, 2025
@asteiker
Copy link
Author

Thank you @kmuehlbauer I didn't quite understand the implications of the sort vs access option. @flamingbear @andypbarrett I thought you may be interested in this as well. I do not have the background to submit a PR on my own but could certainly help organize an effort if others think this would be valuable to pursue.

@kmuehlbauer
Copy link
Contributor

@asteiker No worries.

The main thing is h5netcdf tries to be smart when acquiring a file. It normally just acquires the root-group and keeps sub-level groups lazy. When the user first accesses a sub-level group the needed objects are created. For phony_dims there are two options. phony_dims="access" means it will generate the needed phony_dim_N (N-numbered) dimensions when the user program first accesses a particular group. phony_dims="sort" instead iterates over all groups when the file is opened and creates the needed phony_dim_N dimensions. N is incremented starting from 0 in both cases.

For open_datatree we might still want to use phony_dims="access" for better performance in cases where kwarg group is provided. In case of phony_dims="sort" all groups will be initialized when opening the file (which affects performance, if you do not access the whole tree).

I'll submit a PR using "access" approach, but we can discuss further.

@kmuehlbauer
Copy link
Contributor

I've taken a stab at this in #10058.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants