-
-
Notifications
You must be signed in to change notification settings - Fork 329
HDF5 to Zarr #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also agree this would be cool. Personally would be interested in conversion back from Zarr to HDF5 as well. However, how do you handle some of the more complicated features when going to Zarr. For example, HDF5 has internal links, external links, region references, named datatypes, enumerations, etc. FWICT it doesn't seem like these things are supported by Zarr. Though I suppose they could be. What would you propose to do with these things? |
As I was wanting to play with Zarr some more and needed a way to get an HDF5 file into Zarr, I wrote a primitive function to do this. Happy to PR it somewhere as you see fit. Only handles Datasets and Groups. import os
import h5py
def hdf5_to_zarr(hdf5_file, zarr_group=None):
try:
unicode
except NameError:
unicode = str
opened = False
if isinstance(hdf5_file, (bytes, unicode)):
hdf5_filename = hdf5_file
hdf5_file = h5py.File(hdf5_file, "r")
opened = True
else:
hdf5_filename = hdf5_file.filename
if zarr_group is None:
zarr_name = os.path.splitext(data)[0] + os.extsep + "zarr"
zarr_group = zarr.open_group(zarr_name, mode="w")
def copy(name, obj):
if isinstance(obj, h5py.Group):
zarr_obj = zarr_group.create_group(name)
elif isinstance(obj, h5py.Dataset):
zarr_obj = zarr_group.create_dataset(name, data=obj, chunks=obj.chunks)
else:
assert False, "Unsupport HDF5 type."
zarr_obj.attrs.update(obj.attrs)
hdf5_file.visititems(copy)
if opened:
hdf5_file.close()
return zarr_group |
To do the same thing when converting back to HDF5, it would be pretty handy to be able to use the same visitor pattern used here. |
On Wednesday, November 30, 2016, jakirkham ***@***.***> wrote:
Also agree this would be cool. Personally would be interested in
conversion back from Zarr to HDF5 as well.
However, how do you handle some of the more complicated features when
going to Zarr. For example, HDF5 has internal links, external links, region
references, named datatypes, enumerations, etc. FWICT it doesn't seem like
these things are supported by Zarr. Though I suppose they could be. What
would you propose to do with these things?
I was just thinking to implement something simple that handled arrays
(datasets) and groups and skipped any of the other things you can find in
HDF5, at least for now.
The other slight complication is that, when copying data from HDF5 into
Zarr, you might not always want to use exactly the same compression
options, chunk layout, etc. E.g., I have a use case where some datasets are
better with Blosc+LZ4 and others are better with Blosc+Zstd, and some are
better with C layout while others are better with F chunk layout. It would
probably be reasonable as a first implementation to ignore all this and
just use a single configuration for all arrays that get copied over, but
thought I'd mention it for the record. The only other idea I had was to
allow the user to specify some mapping of paths to compression
configurations, with possibly allowing wildcards in the paths, and these
then get matched as you walk down the hierarchy copying stuff over, but
that might be a little complex to start with.
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/87#issuecomment-263930538>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qo_vAAihZ8Bstwm0wMbgCv9YEK36ks5rDawMgaJpZM4KdnU2>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
Yeah, switching these various format options does get a bit hairy. Would be curious for my own edification to understand a bit more about the motivations for these sorts of changes. |
FWIW when I have an array with more than 1 dimension, I usually try storing
at least a test region of the data using C and F layout and see which gives
better compression ratio. Which gives better compression will depend on the
autocorrelation structure in the data. E.g., for a 2D array, are the data
more correlated across rows or down columns?
Regarding different compression options, I usually try Blosc+LZ4 and
Blosc+Zstd and look at compression ratio and also benchmark read and write
speed. Also worth trying different compression levels and shuffle options,
can make a big difference. You might have seen this already but here's a
benchmark I did for genotype data (which is the main thing I work with):
http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html.
Note this is only benchmarking data stored in memory, if you are putting
data on disk then optimal compression configuration will probably change.
…On Wednesday, November 30, 2016, jakirkham ***@***.***> wrote:
Yeah, switching these various format options does get a bit hairy. Would
be curious for my own edification to understand a bit more about the
motivations for these sorts of changes.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/87#issuecomment-263989983>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QsNP3QrDO1c2VkpeY3OQ5V9xxdinks5rDeBmgaJpZM4KdnU2>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
This might be a bit perverse, but one thing we could consider, which I believe you and others have mentioned before, is we could create a |
When I mentioned previously using HDF5 as a Zarr store, that was the idea that you could (ab)use an HDF5 file as a simple key-value store. E.g.:
Above each chunk of the Zarr array gets stored in a separate dataset in the HDF5 file. This is a different level from what this issue is asking for, I think, where you want to mirror across the groups and arrays of a Zarr hierarchy into groups and datasets in an HDF5 file (or vice versa). |
I see. That is very different indeed. |
Hey, I just came across this project, very cool! There are a lot of similarities between the Zarr storage layout and what I created for the S3-backed HDF Service. For some background see this presentation from SciPy: http://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf. And here's documentation on the schema layout used by the service: http://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf. From what I've seen of the Zarr format I suspect the differences are mainly in a more "HDF5-centric" way of describing the data. E.g. types like Region References that aren't part of the Numpy world. Also the HDF5 object schema assigns UUIDs for each object and then prepends a 5 character hash in front of the object key. This helps out in certain cases where you have many clients trying to read the same S3 object (S3 hits a limit at about 100 req/sec for requests where the first few characters of the key match). As part of the h5pyd project (https://github.com/HDFGroup/h5pyd) I've created apps for listing content, download from S3 to local HDF5 files, uploading to S3, etc. The apps using the service API though, not directly reading/writing to S3. It would be great to find a way to collaborate between these projects. There will certainly be situations where clients would prefer to be able to read/write to object storage directly rather than going through a service interface. |
Hi John, thanks a lot for getting in touch, I wasn't aware of the HDF service work, very cool also! FWIW, one thing that @mrocklin pushed for in Zarr was to encapsulate object storage behind the MutableMapping interface. This has played out very nicely so far, as most of Zarr's logic can sit above this interface, and all issues to do with communicating with object storage (whatever type of storage it happens to be) can be hidden behind, making adding support for other types of object storage pretty straightforward. I wasn't aware of the S3 rate limitation issue, that's good to know. Zarr uses a very simple string for chunk IDs built from indices of the chunk within the chunk grid (e.g., '0.0' for top left chunk in a 2D array), so this could definitely be a problem if many clients were trying to read from the same region. Very happy to explore opportunities for collaboration at any level. |
It would be useful to have a convenience function that copies datasets across from an HDF5 file or group, recursing down through the group hierarchy. Basically a recursive copy of HDF5 group into a Zarr group.
The text was updated successfully, but these errors were encountered: