-
-
Notifications
You must be signed in to change notification settings - Fork 329
Hierarchical storage #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…stored() on store classes
Initial docs are here: http://zarr.readthedocs.io/en/hierarchy/tutorial.html#groups |
@shoyer, @mrocklin, this PR has some initial proof-of-concept work on hierarchical organisation of Zarr arrays via groups. I'm not in any hurry, but if you get a moment I'd greatly appreciate any comments. Mostly undocumented yet except for a small tutorial section. I've tried to keep a layer of abstraction between the underlying storage and the grouping implementation, so that different storage systems could be used. storage.py has been modified to provide underlying storage support via the MemoryStore and DirectoryStore classes. These both implement a new HierarchicalStore interface which defines the API for hierarchical storage. I think the same interface could also be implemented for Zip files, S3, etc. hierarchy.py is a new module implementing the Group class plus some convenience functions. |
def __getitem__(self, key): | ||
names = [s for s in key.split('/') if s] | ||
if not names: | ||
raise ValueError(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyError instead?
As with last time, I'm curious how far we could go with just providing a |
Thanks @mrocklin. I just pushed an initial implementation of a ZipStore supporting hierarchical storage, just to figure out how that could work. It's a bit awkward as you can't really create a directory within a Zip file, but it works. I also cut down the HierarchicalStorage API to the bare minimum. If you can think of some way that this could all be done via the MutableMapping interface I'd be happy to go down that route, I just couldn't think of an obvious way to do it. |
@mrocklin I think I can see how to do this all with stores with just the MutableMapping interface, will be much cleaner, nice idea. |
Hooray! |
To make hierarchies possible, I think I need the following capabilities from a store implementation. I propose to use keys containing forward slashes ("/") to read to and write from resources at different levels of the storage hierarchy. E.g., the store should support operations like The underlying implementation of the store does not need to actually use any kind of hierarchy, it can treat these keys as just strings if it wants to, although some implementations will make use of the hierarchy. E.g., a store using a dict or a zip file or S3 as the underlying container might just treat 'foo/bar' as a string key, but a store using directories on the file system would map the key 'foo/bar' to a file named 'bar' inside a folder named 'foo'. The only other thing I think I need is the ability to delete everything under some part of the storage hierarchy. I need this to be able to delete members from a group. I also need this when initialising a new array that is supposed to overwrite an old array in the same storage location. To get this functionality I propose that the store should handle keys with a trailing slash in a special way. E.g., a call to Treating a key with a trailing slash as a prefix is a similar idea to how S3 implements "folders" as essentially prefixes on resource names. It might also be helpful if a store had special handling for prefix keys in @mrocklin any comments? |
Once you have a way to list "folder contents", you could also just delete things by calling
Rather than overloading So the other option is to explicitly store hierarchical lookups in JSON/dicts. But also that's not such a clean solution for stores already backed by a directory like structure. |
I like the idea of using MutableMapping + a few extra methods if they're around and if then then reverting back to slower operations that match the MutableMapping API. Alternatively maybe we can flush out group metadata in |
Thanks @shoyer, @mrocklin, I think I'm going to try an implementation that uses MutableMapping plus a couple of extra "hierarchy-aware" methods (ls, rm) if available but falling back to just MutableMapping and iterating through all keys if those extra methods are not available. I previously toyed with the idea of putting group metadata in a JSON resource like |
I like this idea. It could be handled by having two Mappings that could, in some cases, be the same. meta = dict()
data = dict()
x = group(meta=meta, data=data) or data = dict()
x = group(meta=data, data=data) As long as you keep key spaces different it might be possible to build this cleanly. |
(although of course I don't know enough to really comment, since I'm not the one building this thing :)) |
Thanks @mrocklin, @shoyer, @FrancescAlted for all the input. Here's what I've done. The The Also you can now provide separate stores for metadata and chunks. E.g.: >>> import zarr
>>> store = dict()
>>> chunk_store = dict()
>>> g = zarr.group(store, chunk_store=chunk_store)
>>> a = g.create_dataset('foo/bar', shape=(20, 20), chunks=(10, 10))
>>> a[:] = 42
>>> sorted(store.keys())
['.zattrs', '.zgroup', 'foo/.zattrs', 'foo/.zgroup', 'foo/bar/.zarray', 'foo/bar/.zattrs']
>>> sorted(chunk_store.keys())
['foo/bar/0.0', 'foo/bar/0.1', 'foo/bar/1.0', 'foo/bar/1.1'] If @shoyer yes there could be problems with JSON. I've added a test for using np.nan as a fill_value and that seems to work without any special handling, but other fill values could cause problems. I propose to stick with JSON for version 2 and see how things work out, and consider alternatives for version 3. @FrancescAlted I completely agree about naming. Especially I don't like "chunks" as a parameter name either, it's not intuitive at all. I'm trying to strike a balance between familiarity for users (like me) using zarr as a replacement for h5py, and making things clearer where possible. I've left the "chunks" parameter as-is for consistency with h5py, and also because it's used in a similar way in dask. The Very cool that work is going ahead to put PyTables on top of h5py. Would be good to keep in touch as that develops, to see if there would be anything that could be done in Zarr to facilitate usage as a storage backend. |
PyTables on top of Zarr would be very cool. In my world PyTables on top of On Fri, Aug 26, 2016 at 11:15 AM, Alistair Miles [email protected]
|
NaN works with Python's JSON module, but there are many libraries that will
|
On Friday, August 26, 2016, Stephan Hoyer [email protected] wrote:
Thanks Stephan, I've changed this in implementation and spec.
Alistair Miles |
Probability it's worth add |
|
||
Simple data types are encoded within the array metadata resource as a string, | ||
following the `NumPy array protocol type string (typestr) format | ||
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, I suggest explicitly writing out the list of valid type codes and a brief description. I think that would be b
, i
, u
, f
, c
, S
and U
? I'm not 100% sure it's worth including b
or U
because both are highly inefficiently, though I guess compression should alleviate that somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I've expanded the section a bit to include this.
On Friday, 26 August 2016, Stephan Hoyer [email protected] wrote:
In docs/spec/v2.rst
https://github.com/alimanfoo/zarr/pull/37#discussion_r76455077:
"dtype": "<f8",
"fill_value": null,
"order": "C",
"shape": [
10000,
10000
],
"zarr_format": 2
- }
+Data type encoding
+~~~~~~~~~~~~~~~~~~
+
+Simple data types are encoded within the array metadata resource as a string,
+following theNumPy array protocol type string (typestr) format +<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>
_. The formatFor completeness, I suggest explicitly writing out the list of valid type
codes and a brief description. I think that would be b, i, u, f, c, S and
U? I'm not 100% sure it's worth including b or U because both are highly
inefficiently, though I guess compression should alleviate that somewhat.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37/files/8c8dbab434f8cc1fc71d189330d35732564ebf56#r76455077,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QhYdNJVDTaELdMLcRaH4V5Wzlgtyks5qjyDTgaJpZM4JUIhp
.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
Added encoding for infinities. On Friday, 26 August 2016, Stephan Hoyer [email protected] wrote:
Alistair Miles |
Merging tomorrow if no further comments. |
This PR has an initial proof of concept implementing hierarchical grouping of arrays (#26).
TODO: