phony_dims must be specified when opening HDF5 file without dimension scales #10049

asteiker · 2025-02-14T00:33:50Z

What is your issue?

Hello, I am in the xarray.DataTree() fan club(!) and have been starting to update existing tutorials to work NASA ICESat-2 HDF5 data in xarray, transitioning from earlier xarray.open_dataset() guidance that only allowed for a single group to be specified.

I was hoping ICESat-2 files would just open out of the box with datatree now, but I get an error unless I specify phony_dims:

dt = xr.open_datatree(file, phony_dims='sort')
(see full notebook here , or you could download an example file here)

@eni-awowale provided some helpful guidance, and sounds like it may be an issue between interoperability with the hdf5 and netcdf-c library.

Since xarray can already detect phony_dims, we were wondering if this could be a reasonable add to the h5netcdf backend engine. This could greatly streamline HDF5 users' workflow so that files can open out of the box w/o needing to specify additional kwargs.

The text was updated successfully, but these errors were encountered:

welcome · 2025-02-14T00:33:53Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

kmuehlbauer · 2025-02-14T09:10:10Z

@asteiker Thanks for bringing this to attention.

When handling of phony_dims was implemented in h5netcdf to cover for unassigned dimension scales (as netcdf-c/netCDF4-python does) there where two choices.

sort - iterate over all groups and assign phony_dims. This would be in line with netcdf-c/netCDF4-python, creating equivalent dimension names.
access - assign phony_dims, when actually accessing a certain group

The latter has some real performance improvements (depending on what and how data is read), so the decision was to let the user decide which implementation they want to use.

I'm not sure, if this performance gain still holds true when acquiring the whole DataTree (instead of only a single group). So we might think about setting phony_dims="sort" or phony_dims="access" for open_datatree for the h5netcdf-backend, depending on feedback. I'd gladly review an according PR.

asteiker · 2025-02-17T23:38:26Z

Thank you @kmuehlbauer I didn't quite understand the implications of the sort vs access option. @flamingbear @andypbarrett I thought you may be interested in this as well. I do not have the background to submit a PR on my own but could certainly help organize an effort if others think this would be valuable to pursue.

kmuehlbauer · 2025-02-18T07:01:11Z

@asteiker No worries.

The main thing is h5netcdf tries to be smart when acquiring a file. It normally just acquires the root-group and keeps sub-level groups lazy. When the user first accesses a sub-level group the needed objects are created. For phony_dims there are two options. phony_dims="access" means it will generate the needed phony_dim_N (N-numbered) dimensions when the user program first accesses a particular group. phony_dims="sort" instead iterates over all groups when the file is opened and creates the needed phony_dim_N dimensions. N is incremented starting from 0 in both cases.

For open_datatree we might still want to use phony_dims="access" for better performance in cases where kwarg group is provided. In case of phony_dims="sort" all groups will be initialized when opening the file (which affects performance, if you do not access the whole tree).

I'll submit a PR using "access" approach, but we can discuss further.

kmuehlbauer · 2025-02-18T07:51:55Z

I've taken a stab at this in #10058.

asteiker added the needs triage Issue that has not been reviewed by xarray team member label Feb 14, 2025

kmuehlbauer added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 14, 2025

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 14, 2025

kmuehlbauer mentioned this issue Feb 18, 2025

Default to phony_dims="access" in h5netcdf-backend #10058

Merged

3 tasks

kmuehlbauer closed this as completed in #10058 Feb 24, 2025

kmuehlbauer mentioned this issue May 30, 2025

Unconstrained forwarding of backend keyword arguments #10377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

phony_dims must be specified when opening HDF5 file without dimension scales #10049

phony_dims must be specified when opening HDF5 file without dimension scales #10049

asteiker commented Feb 14, 2025

welcome bot commented Feb 14, 2025

Uh oh!

kmuehlbauer commented Feb 14, 2025

Uh oh!

asteiker commented Feb 17, 2025

Uh oh!

kmuehlbauer commented Feb 18, 2025

Uh oh!

kmuehlbauer commented Feb 18, 2025

Uh oh!

Uh oh!

phony_dims must be specified when opening HDF5 file without dimension scales #10049

phony_dims must be specified when opening HDF5 file without dimension scales #10049

Comments

asteiker commented Feb 14, 2025

What is your issue?

welcome bot commented Feb 14, 2025

Uh oh!

kmuehlbauer commented Feb 14, 2025

Uh oh!

asteiker commented Feb 17, 2025

Uh oh!

kmuehlbauer commented Feb 18, 2025

Uh oh!

kmuehlbauer commented Feb 18, 2025

Uh oh!