HDF5 to Zarr #87

alimanfoo · 2016-10-21T21:40:29Z

It would be useful to have a convenience function that copies datasets across from an HDF5 file or group, recursing down through the group hierarchy. Basically a recursive copy of HDF5 group into a Zarr group.

jakirkham · 2016-11-30T17:02:04Z

Also agree this would be cool. Personally would be interested in conversion back from Zarr to HDF5 as well.

However, how do you handle some of the more complicated features when going to Zarr. For example, HDF5 has internal links, external links, region references, named datatypes, enumerations, etc. FWICT it doesn't seem like these things are supported by Zarr. Though I suppose they could be. What would you propose to do with these things?

jakirkham · 2016-11-30T17:34:28Z

As I was wanting to play with Zarr some more and needed a way to get an HDF5 file into Zarr, I wrote a primitive function to do this. Happy to PR it somewhere as you see fit. Only handles Datasets and Groups.

import os
import h5py


def hdf5_to_zarr(hdf5_file, zarr_group=None):
    try:
        unicode
    except NameError:
        unicode = str

    opened = False
    if isinstance(hdf5_file, (bytes, unicode)):
        hdf5_filename = hdf5_file
        hdf5_file = h5py.File(hdf5_file, "r")
        opened = True
    else:
        hdf5_filename = hdf5_file.filename

    if zarr_group is None:
        zarr_name = os.path.splitext(data)[0] + os.extsep + "zarr"
        zarr_group = zarr.open_group(zarr_name, mode="w")

    def copy(name, obj):
        if isinstance(obj, h5py.Group):
            zarr_obj = zarr_group.create_group(name)
        elif isinstance(obj, h5py.Dataset):
            zarr_obj = zarr_group.create_dataset(name, data=obj, chunks=obj.chunks)
        else:
            assert False, "Unsupport HDF5 type."

        zarr_obj.attrs.update(obj.attrs) 

    hdf5_file.visititems(copy)

    if opened:
        hdf5_file.close()

    return zarr_group

jakirkham · 2016-11-30T17:49:45Z

To do the same thing when converting back to HDF5, it would be pretty handy to be able to use the same visitor pattern used here.

xref: https://github.com/alimanfoo/zarr/issues/92

alimanfoo · 2016-11-30T20:26:38Z

On Wednesday, November 30, 2016, jakirkham ***@***.***> wrote: Also agree this would be cool. Personally would be interested in conversion back from Zarr to HDF5 as well. However, how do you handle some of the more complicated features when going to Zarr. For example, HDF5 has internal links, external links, region references, named datatypes, enumerations, etc. FWICT it doesn't seem like these things are supported by Zarr. Though I suppose they could be. What would you propose to do with these things?

I was just thinking to implement something simple that handled arrays (datasets) and groups and skipped any of the other things you can find in HDF5, at least for now. The other slight complication is that, when copying data from HDF5 into Zarr, you might not always want to use exactly the same compression options, chunk layout, etc. E.g., I have a use case where some datasets are better with Blosc+LZ4 and others are better with Blosc+Zstd, and some are better with C layout while others are better with F chunk layout. It would probably be reasonable as a first implementation to ignore all this and just use a single configuration for all arrays that get copied over, but thought I'd mention it for the record. The only other idea I had was to allow the user to specify some mapping of paths to compression configurations, with possibly allowing wildcards in the paths, and these then get matched as you walk down the hierarchy copying stuff over, but that might be a little complex to start with.

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/87#issuecomment-263930538>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qo_vAAihZ8Bstwm0wMbgCv9YEK36ks5rDawMgaJpZM4KdnU2> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: [email protected] Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

jakirkham · 2016-11-30T20:45:25Z

Yeah, switching these various format options does get a bit hairy. Would be curious for my own edification to understand a bit more about the motivations for these sorts of changes.

alimanfoo · 2016-11-30T20:56:42Z

FWIW when I have an array with more than 1 dimension, I usually try storing at least a test region of the data using C and F layout and see which gives better compression ratio. Which gives better compression will depend on the autocorrelation structure in the data. E.g., for a 2D array, are the data more correlated across rows or down columns? Regarding different compression options, I usually try Blosc+LZ4 and Blosc+Zstd and look at compression ratio and also benchmark read and write speed. Also worth trying different compression levels and shuffle options, can make a big difference. You might have seen this already but here's a benchmark I did for genotype data (which is the main thing I work with): http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html. Note this is only benchmarking data stored in memory, if you are putting data on disk then optimal compression configuration will probably change.

…

On Wednesday, November 30, 2016, jakirkham ***@***.***> wrote: Yeah, switching these various format options does get a bit hairy. Would be curious for my own edification to understand a bit more about the motivations for these sorts of changes. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/87#issuecomment-263989983>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QsNP3QrDO1c2VkpeY3OQ5V9xxdinks5rDeBmgaJpZM4KdnU2> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: [email protected] Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

jakirkham · 2017-03-01T17:23:29Z

This might be a bit perverse, but one thing we could consider, which I believe you and others have mentioned before, is we could create a MutableMapping-based HDF5 store. Then we could leverage store conversion methods ( https://github.com/alimanfoo/zarr/issues/137 ) to move the data to HDF5. Not sure if that would yield the desired outcome or not, but that should leave things pretty general.

alimanfoo · 2017-03-01T22:33:37Z

When I mentioned previously using HDF5 as a Zarr store, that was the idea that you could (ab)use an HDF5 file as a simple key-value store. E.g.:

In [63]: class HDF5Store(MutableMapping):
    ...:     def __init__(self, h5g):
    ...:         self.h5g = h5g
    ...:     def __getitem__(self, key):
    ...:         ds = self.h5g[key]
    ...:         if ds.shape:
    ...:             return ds[:]
    ...:         else:
    ...:             return ds[()]
    ...:     def __setitem__(self, key, value):
    ...:         try:
    ...:             del self.h5g[key]
    ...:         except KeyError:
    ...:             pass
    ...:         self.h5g[key] = np.asarray(value)
    ...:     def __delitem__(self, key):
    ...:         del self.h5g[key]
    ...:     def __iter__(self):
    ...:         return iter(self.h5g)
    ...:     def __len__(self):
    ...:         return len(self.h5g)    
    ...:     

In [64]: import numpy as np

In [65]: import h5py

In [66]: h = h5py.File('spike.h5', mode='a')

In [67]: store = HDF5Store(h)

In [68]: z = zarr.zeros(100, chunks=10, store=store, overwrite=True)

In [69]: list(h)
Out[69]: ['.zarray', '.zattrs']

In [70]: z[:] = 42

In [71]: list(h)
Out[71]: ['.zarray', '.zattrs', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [72]: h['0']
Out[72]: <HDF5 dataset "0": shape (), type "|S96">

Above each chunk of the Zarr array gets stored in a separate dataset in the HDF5 file.

This is a different level from what this issue is asking for, I think, where you want to mirror across the groups and arrays of a Zarr hierarchy into groups and datasets in an HDF5 file (or vice versa).

jakirkham · 2017-03-02T05:55:50Z

I see. That is very different indeed.

jreadey · 2017-08-17T16:52:03Z

Hey, I just came across this project, very cool!

There are a lot of similarities between the Zarr storage layout and what I created for the S3-backed HDF Service. For some background see this presentation from SciPy: http://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf.

And here's documentation on the schema layout used by the service: http://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf.

From what I've seen of the Zarr format I suspect the differences are mainly in a more "HDF5-centric" way of describing the data. E.g. types like Region References that aren't part of the Numpy world.

Also the HDF5 object schema assigns UUIDs for each object and then prepends a 5 character hash in front of the object key. This helps out in certain cases where you have many clients trying to read the same S3 object (S3 hits a limit at about 100 req/sec for requests where the first few characters of the key match).

As part of the h5pyd project (https://github.com/HDFGroup/h5pyd) I've created apps for listing content, download from S3 to local HDF5 files, uploading to S3, etc. The apps using the service API though, not directly reading/writing to S3.

It would be great to find a way to collaborate between these projects. There will certainly be situations where clients would prefer to be able to read/write to object storage directly rather than going through a service interface.

alimanfoo · 2017-08-17T22:06:42Z

Hi John, thanks a lot for getting in touch, I wasn't aware of the HDF service work, very cool also!

FWIW, one thing that @mrocklin pushed for in Zarr was to encapsulate object storage behind the MutableMapping interface. This has played out very nicely so far, as most of Zarr's logic can sit above this interface, and all issues to do with communicating with object storage (whatever type of storage it happens to be) can be hidden behind, making adding support for other types of object storage pretty straightforward.

I wasn't aware of the S3 rate limitation issue, that's good to know. Zarr uses a very simple string for chunk IDs built from indices of the chunk within the chunk grid (e.g., '0.0' for top left chunk in a 2D array), so this could definitely be a problem if many clients were trying to read from the same region.

Very happy to explore opportunities for collaboration at any level.

alimanfoo added this to the v2.2 milestone Nov 19, 2017

alimanfoo added the enhancement New features or improvements label Nov 21, 2017

alimanfoo mentioned this issue Dec 9, 2017

Copy functions #217

Merged

5 tasks

alimanfoo self-assigned this Dec 9, 2017

alimanfoo added the in progress Someone is currently working on this label Dec 9, 2017

alimanfoo closed this as completed in #217 Jan 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HDF5 to Zarr #87

HDF5 to Zarr #87

alimanfoo commented Oct 21, 2016

jakirkham commented Nov 30, 2016

Uh oh!

jakirkham commented Nov 30, 2016 •

edited

Loading

Uh oh!

jakirkham commented Nov 30, 2016

Uh oh!

alimanfoo commented Nov 30, 2016 via email

Uh oh!

jakirkham commented Nov 30, 2016

Uh oh!

alimanfoo commented Nov 30, 2016 via email

Uh oh!

jakirkham commented Mar 1, 2017

Uh oh!

alimanfoo commented Mar 1, 2017

Uh oh!

jakirkham commented Mar 2, 2017

Uh oh!

jreadey commented Aug 17, 2017

Uh oh!

alimanfoo commented Aug 17, 2017

Uh oh!

Uh oh!

HDF5 to Zarr #87

HDF5 to Zarr #87

Comments

alimanfoo commented Oct 21, 2016

jakirkham commented Nov 30, 2016

Uh oh!

jakirkham commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Nov 30, 2016

Uh oh!

alimanfoo commented Nov 30, 2016 via email

Uh oh!

jakirkham commented Nov 30, 2016

Uh oh!

alimanfoo commented Nov 30, 2016 via email

Uh oh!

jakirkham commented Mar 1, 2017

Uh oh!

alimanfoo commented Mar 1, 2017

Uh oh!

jakirkham commented Mar 2, 2017

Uh oh!

jreadey commented Aug 17, 2017

Uh oh!

alimanfoo commented Aug 17, 2017

Uh oh!

jakirkham commented Nov 30, 2016 •

edited

Loading