Skip to content

Trying out additional example files #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
scottyhq opened this issue Nov 19, 2020 · 6 comments
Closed

Trying out additional example files #9

scottyhq opened this issue Nov 19, 2020 · 6 comments

Comments

@scottyhq
Copy link

This is really neat, and I'm excited to try things out with some additional HDF files!

I realize the goal is to flesh out the specification and this is not a general conversion tool yet, but it seems like working with more HDF files out in the wild might bring things to light.

Some initial questions/suggestions:

  1. How to write out the .zchunkstore? seems things are setup currently to just output logging info
  2. Add chunk info output to logger (maybe dtype and MB too?) lggr.debug(f'_ARRAY_CHUNKS = {h5obj.chunks}')
  3. Could be useful to first check input file is valid H5. this is easy for a local file, not sure about remote:
        if not h5py.is_hdf5(f):
            raise ValueError('Not an hdf5 file')

What isn't supported?
https://github.com/intake/fsspec-reference-maker/blob/bf41138add53b0201e583aa40840cd4fa5fb907b/fsspec_reference_maker/hdf.py#L103-L106

The first file I tried to generate .zchunkstore with ran into the above, code and traceback below:

def ATL06_remote():
    return hdf2zarr.run(
        's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5',
        mode='rb', anon=False, requester_pays=True,
        default_fill_cache=False, default_cache_type='none'
    )
DEBUG:h5-to-zarr:translator:Group: /gt1l/land_ice_segments
DEBUG:h5-to-zarr:translator:Dataset: /gt1l/land_ice_segments/atl06_quality_summary
Traceback (most recent call last):
  File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "/Users/scott/miniconda3/envs/fsspec-ref/lib/python3.8/site-packages/h5py/_hl/group.py", line 591, in proxy
    return func(name, self[name])
  File "/Users/scott/GitHub/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 105, in translator
    raise RuntimeError(
RuntimeError: /gt1l/land_ice_segments/atl06_quality_summary uses unsupported HDF5 filters

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./test.py", line 45, in <module>
    ATL06_remote()
  File "./test.py", line 15, in ATL06_remote
    return hdf2zarr.run(
  File "/Users/scott/GitHub/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 273, in run
    return h5chunks.translate()
  File "/Users/scott/GitHub/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 54, in translate
    self._h5f.visititems(self.translator)
  File "/Users/scott/miniconda3/envs/fsspec-ref/lib/python3.8/site-packages/h5py/_hl/group.py", line 592, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
SystemError: <built-in function visit> returned a result with an error set
@rsignell-usgs
Copy link
Collaborator

@ajelenak is the person who knows must about this part!

@martindurant
Copy link
Member

martindurant commented Nov 20, 2020

I believe that many of the possible compression types are indeed supported by zarr, but the script only tries with gzip here.

I don't know what szip and lzf compressions are, but they are would presumably need to be implemented in numcodecs to be readable if they are not already under a different name.
scaleoffset is certainly implemented, but I don't know what fletcher32 is (a checksum?).

It would be worth your while finding out which of these cases applies.

@martindurant
Copy link
Member

I can confirm that compiling and pointing to the SZIP decompressor is very doable, could be made into a conda package if really needed. The API is not documented, but looks guessable.

@ajelenak
Copy link
Collaborator

ajelenak commented Nov 20, 2020

@ajelenak
Copy link
Collaborator

but I don't know what fletcher32 is (a checksum?).

Yes, Fletcher32 is a checksum HDF5 filter. It is used to catch any read errors from HDF5 dataset chunks. When using the Fletcher32 filter, a checksum is calculated on every chunk write operation and stored with the chunk.

@scottyhq
Copy link
Author

scottyhq commented May 6, 2021

Just wanted to point people here towards an optimized read-only approach to work with the icesat2 HDF5 data described in this issue http://icesat2sliderule.org/h5coro , would be interesting to compare against the fsspec-reference-maker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants