Skip to content

HDF5 to Zarr #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alimanfoo opened this issue Oct 21, 2016 · 11 comments · Fixed by #217
Closed

HDF5 to Zarr #87

alimanfoo opened this issue Oct 21, 2016 · 11 comments · Fixed by #217
Assignees
Labels
enhancement New features or improvements in progress Someone is currently working on this
Milestone

Comments

@alimanfoo
Copy link
Member

It would be useful to have a convenience function that copies datasets across from an HDF5 file or group, recursing down through the group hierarchy. Basically a recursive copy of HDF5 group into a Zarr group.

@jakirkham
Copy link
Member

Also agree this would be cool. Personally would be interested in conversion back from Zarr to HDF5 as well.

However, how do you handle some of the more complicated features when going to Zarr. For example, HDF5 has internal links, external links, region references, named datatypes, enumerations, etc. FWICT it doesn't seem like these things are supported by Zarr. Though I suppose they could be. What would you propose to do with these things?

@jakirkham
Copy link
Member

jakirkham commented Nov 30, 2016

As I was wanting to play with Zarr some more and needed a way to get an HDF5 file into Zarr, I wrote a primitive function to do this. Happy to PR it somewhere as you see fit. Only handles Datasets and Groups.

import os
import h5py


def hdf5_to_zarr(hdf5_file, zarr_group=None):
    try:
        unicode
    except NameError:
        unicode = str

    opened = False
    if isinstance(hdf5_file, (bytes, unicode)):
        hdf5_filename = hdf5_file
        hdf5_file = h5py.File(hdf5_file, "r")
        opened = True
    else:
        hdf5_filename = hdf5_file.filename

    if zarr_group is None:
        zarr_name = os.path.splitext(data)[0] + os.extsep + "zarr"
        zarr_group = zarr.open_group(zarr_name, mode="w")

    def copy(name, obj):
        if isinstance(obj, h5py.Group):
            zarr_obj = zarr_group.create_group(name)
        elif isinstance(obj, h5py.Dataset):
            zarr_obj = zarr_group.create_dataset(name, data=obj, chunks=obj.chunks)
        else:
            assert False, "Unsupport HDF5 type."

        zarr_obj.attrs.update(obj.attrs) 

    hdf5_file.visititems(copy)

    if opened:
        hdf5_file.close()

    return zarr_group

@jakirkham
Copy link
Member

To do the same thing when converting back to HDF5, it would be pretty handy to be able to use the same visitor pattern used here.

xref: https://github.com/alimanfoo/zarr/issues/92

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 30, 2016 via email

@jakirkham
Copy link
Member

Yeah, switching these various format options does get a bit hairy. Would be curious for my own edification to understand a bit more about the motivations for these sorts of changes.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 30, 2016 via email

@jakirkham
Copy link
Member

This might be a bit perverse, but one thing we could consider, which I believe you and others have mentioned before, is we could create a MutableMapping-based HDF5 store. Then we could leverage store conversion methods ( https://github.com/alimanfoo/zarr/issues/137 ) to move the data to HDF5. Not sure if that would yield the desired outcome or not, but that should leave things pretty general.

@alimanfoo
Copy link
Member Author

When I mentioned previously using HDF5 as a Zarr store, that was the idea that you could (ab)use an HDF5 file as a simple key-value store. E.g.:

In [63]: class HDF5Store(MutableMapping):
    ...:     def __init__(self, h5g):
    ...:         self.h5g = h5g
    ...:     def __getitem__(self, key):
    ...:         ds = self.h5g[key]
    ...:         if ds.shape:
    ...:             return ds[:]
    ...:         else:
    ...:             return ds[()]
    ...:     def __setitem__(self, key, value):
    ...:         try:
    ...:             del self.h5g[key]
    ...:         except KeyError:
    ...:             pass
    ...:         self.h5g[key] = np.asarray(value)
    ...:     def __delitem__(self, key):
    ...:         del self.h5g[key]
    ...:     def __iter__(self):
    ...:         return iter(self.h5g)
    ...:     def __len__(self):
    ...:         return len(self.h5g)    
    ...:     

In [64]: import numpy as np

In [65]: import h5py

In [66]: h = h5py.File('spike.h5', mode='a')

In [67]: store = HDF5Store(h)

In [68]: z = zarr.zeros(100, chunks=10, store=store, overwrite=True)

In [69]: list(h)
Out[69]: ['.zarray', '.zattrs']

In [70]: z[:] = 42

In [71]: list(h)
Out[71]: ['.zarray', '.zattrs', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [72]: h['0']
Out[72]: <HDF5 dataset "0": shape (), type "|S96">

Above each chunk of the Zarr array gets stored in a separate dataset in the HDF5 file.

This is a different level from what this issue is asking for, I think, where you want to mirror across the groups and arrays of a Zarr hierarchy into groups and datasets in an HDF5 file (or vice versa).

@jakirkham
Copy link
Member

I see. That is very different indeed.

@jreadey
Copy link

jreadey commented Aug 17, 2017

Hey, I just came across this project, very cool!

There are a lot of similarities between the Zarr storage layout and what I created for the S3-backed HDF Service. For some background see this presentation from SciPy: http://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf.

And here's documentation on the schema layout used by the service: http://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf.

From what I've seen of the Zarr format I suspect the differences are mainly in a more "HDF5-centric" way of describing the data. E.g. types like Region References that aren't part of the Numpy world.

Also the HDF5 object schema assigns UUIDs for each object and then prepends a 5 character hash in front of the object key. This helps out in certain cases where you have many clients trying to read the same S3 object (S3 hits a limit at about 100 req/sec for requests where the first few characters of the key match).

As part of the h5pyd project (https://github.com/HDFGroup/h5pyd) I've created apps for listing content, download from S3 to local HDF5 files, uploading to S3, etc. The apps using the service API though, not directly reading/writing to S3.

It would be great to find a way to collaborate between these projects. There will certainly be situations where clients would prefer to be able to read/write to object storage directly rather than going through a service interface.

@alimanfoo
Copy link
Member Author

Hi John, thanks a lot for getting in touch, I wasn't aware of the HDF service work, very cool also!

FWIW, one thing that @mrocklin pushed for in Zarr was to encapsulate object storage behind the MutableMapping interface. This has played out very nicely so far, as most of Zarr's logic can sit above this interface, and all issues to do with communicating with object storage (whatever type of storage it happens to be) can be hidden behind, making adding support for other types of object storage pretty straightforward.

I wasn't aware of the S3 rate limitation issue, that's good to know. Zarr uses a very simple string for chunk IDs built from indices of the chunk within the chunk grid (e.g., '0.0' for top left chunk in a 2D array), so this could definitely be a problem if many clients were trying to read from the same region.

Very happy to explore opportunities for collaboration at any level.

@alimanfoo alimanfoo added this to the v2.2 milestone Nov 19, 2017
@alimanfoo alimanfoo added the enhancement New features or improvements label Nov 21, 2017
@alimanfoo alimanfoo mentioned this issue Dec 9, 2017
5 tasks
@alimanfoo alimanfoo self-assigned this Dec 9, 2017
@alimanfoo alimanfoo added the in progress Someone is currently working on this label Dec 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements in progress Someone is currently working on this
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants