Skip to content

Hierarchical storage #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Aug 30, 2016
Merged

Hierarchical storage #37

merged 56 commits into from
Aug 30, 2016

Conversation

alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Jul 25, 2016

This PR has an initial proof of concept implementing hierarchical grouping of arrays (#26).

TODO:

  • Unit tests
  • ZipStore
  • Docstrings
  • API docs
  • Tutorial

@alimanfoo
Copy link
Member Author

@alimanfoo alimanfoo changed the title WIP hierarchical storage Hierarchical storage WIP Jul 25, 2016
@alimanfoo
Copy link
Member Author

@shoyer, @mrocklin, this PR has some initial proof-of-concept work on hierarchical organisation of Zarr arrays via groups. I'm not in any hurry, but if you get a moment I'd greatly appreciate any comments.

Mostly undocumented yet except for a small tutorial section.

I've tried to keep a layer of abstraction between the underlying storage and the grouping implementation, so that different storage systems could be used.

storage.py has been modified to provide underlying storage support via the MemoryStore and DirectoryStore classes. These both implement a new HierarchicalStore interface which defines the API for hierarchical storage. I think the same interface could also be implemented for Zip files, S3, etc.

hierarchy.py is a new module implementing the Group class plus some convenience functions.

def __getitem__(self, key):
names = [s for s in key.split('/') if s]
if not names:
raise ValueError(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyError instead?

@mrocklin
Copy link
Contributor

As with last time, I'm curious how far we could go with just providing a MutableMapping and nothing else. If we can reduce idea of hierarchical storage to just a convention on top of dict-like objects then I think that extensibility and code maintenance could both be pretty cool.

@alimanfoo
Copy link
Member Author

Thanks @mrocklin. I just pushed an initial implementation of a ZipStore supporting hierarchical storage, just to figure out how that could work. It's a bit awkward as you can't really create a directory within a Zip file, but it works.

I also cut down the HierarchicalStorage API to the bare minimum. If you can think of some way that this could all be done via the MutableMapping interface I'd be happy to go down that route, I just couldn't think of an obvious way to do it.

@alimanfoo
Copy link
Member Author

@mrocklin I think I can see how to do this all with stores with just the MutableMapping interface, will be much cleaner, nice idea.

@mrocklin
Copy link
Contributor

Hooray!

@alimanfoo
Copy link
Member Author

To make hierarchies possible, I think I need the following capabilities from a store implementation.

I propose to use keys containing forward slashes ("/") to read to and write from resources at different levels of the storage hierarchy. E.g., the store should support operations like store['foo/bar'] to read and store['foo/bar'] = b'baz' to write. It should also be possible to assign a value to the key 'foo/bar' without having to assign anything to the key 'foo' previously.

The underlying implementation of the store does not need to actually use any kind of hierarchy, it can treat these keys as just strings if it wants to, although some implementations will make use of the hierarchy. E.g., a store using a dict or a zip file or S3 as the underlying container might just treat 'foo/bar' as a string key, but a store using directories on the file system would map the key 'foo/bar' to a file named 'bar' inside a folder named 'foo'.

The only other thing I think I need is the ability to delete everything under some part of the storage hierarchy. I need this to be able to delete members from a group. I also need this when initialising a new array that is supposed to overwrite an old array in the same storage location. To get this functionality I propose that the store should handle keys with a trailing slash in a special way. E.g., a call to del store['foo/'] should delete all items from the store whose key begins with the prefix 'foo/'.

Treating a key with a trailing slash as a prefix is a similar idea to how S3 implements "folders" as essentially prefixes on resource names.

It might also be helpful if a store had special handling for prefix keys in __getitem__ so that it returned a listing of "folder contents". E.g., if a store contained values for the keys 'foo/bar' and 'foo/baz/quux', a call to store['foo/'] would return the list ['bar', 'baz']. I don't think I strictly need this, but it might be handy. I'm trying to figure out the best way for a group to be able to list its members, and one option is to inspect the contents of the store at a given level in the storage hierarchy. Another option is to store a list of members in a JSON metadata resource, which might be more efficient in some cases, but has the downside that the metadata could get out of sync with the store, which I'd like to avoid as a possibility.

@mrocklin any comments?

@shoyer
Copy link
Contributor

shoyer commented Jul 27, 2016

The only other thing I think I need is the ability to delete everything under some part of the storage hierarchy.

Once you have a way to list "folder contents", you could also just delete things by calling del in a loop (though it might be very slow).

It might also be helpful if a store had special handling for prefix keys in getitem so that it returned a listing of "folder contents".

Rather than overloading __getitem__, I would suggest defining another method for this, e.g., store.ls('foo/'). But at this point, the mapping interface isn't really enough, so you should call this something else.

So the other option is to explicitly store hierarchical lookups in JSON/dicts. But also that's not such a clean solution for stores already backed by a directory like structure.

@mrocklin
Copy link
Contributor

I like the idea of using MutableMapping + a few extra methods if they're around and if then then reverting back to slower operations that match the MutableMapping API.

Alternatively maybe we can flush out group metadata in foo/bar/.meta to give us some information to accelerate certain operations?

@alimanfoo
Copy link
Member Author

Thanks @shoyer, @mrocklin, I think I'm going to try an implementation that uses MutableMapping plus a couple of extra "hierarchy-aware" methods (ls, rm) if available but falling back to just MutableMapping and iterating through all keys if those extra methods are not available.

I previously toyed with the idea of putting group metadata in a JSON resource like foo/bar/.meta which would include a list of names of group members. This would mean no need for the store to list "folder contents". However I think I'm going against this idea because (1) the metadata can get out of sync with the storage if someone modifies the storage directly, and (2) if using a zip file and adding multiple members to the same group, this foo/bar/.meta resource would get "added" to the archive multiple times. Adding multiple files with the same name to a zip file is possible in Python but generates a UserWarning which is not very pretty. Also there appears to be an open issue about this which suggests behaviour might change in future.

@mrocklin
Copy link
Contributor

More generally, what about hybrid storage systems for metadata/data? I suppose this could be implemented pretty easily with a custom Storage class, but it's pretty easy to imagine wanting to store group and array metadata separately (in memory or an actual database) from chunks (which for most applications can be stored cheaply on disk). This is the more general case of @FrancescAlted's node cache. It might be worth at least prototyping out an example to make sure it works smoothly.

I like this idea. It could be handled by having two Mappings that could, in some cases, be the same.

meta = dict()
data = dict()

x = group(meta=meta, data=data)

or

data = dict()

x = group(meta=data, data=data)

As long as you keep key spaces different it might be possible to build this cleanly.

@mrocklin
Copy link
Contributor

(although of course I don't know enough to really comment, since I'm not the one building this thing :))

@alimanfoo
Copy link
Member Author

Thanks @mrocklin, @shoyer, @FrancescAlted for all the input. Here's what I've done.

The Group class now extends collections.Mapping. Implementations of keys(), values() and items() have been removed as these are provided by superclass. Also __setitem__ definition has been removed which means Python now automatically raises TypeError if someone tries to do item assignment.

The Group.create_dataset() method still accepts additional keyword arguments to ease the transition for users coming from h5py, but a user warning is generated and the arguments are ignored. This seemed more straightforward.

Also you can now provide separate stores for metadata and chunks. E.g.:

>>> import zarr
>>> store = dict()
>>> chunk_store = dict()
>>> g = zarr.group(store, chunk_store=chunk_store)
>>> a = g.create_dataset('foo/bar', shape=(20, 20), chunks=(10, 10))
>>> a[:] = 42
>>> sorted(store.keys())
['.zattrs', '.zgroup', 'foo/.zattrs', 'foo/.zgroup', 'foo/bar/.zarray', 'foo/bar/.zattrs']
>>> sorted(chunk_store.keys())
['foo/bar/0.0', 'foo/bar/0.1', 'foo/bar/1.0', 'foo/bar/1.1']

If chunk_store is not provided, chunks and metadata will be stored together in store as before.

@shoyer yes there could be problems with JSON. I've added a test for using np.nan as a fill_value and that seems to work without any special handling, but other fill values could cause problems. I propose to stick with JSON for version 2 and see how things work out, and consider alternatives for version 3.

@FrancescAlted I completely agree about naming. Especially I don't like "chunks" as a parameter name either, it's not intuitive at all. I'm trying to strike a balance between familiarity for users (like me) using zarr as a replacement for h5py, and making things clearer where possible. I've left the "chunks" parameter as-is for consistency with h5py, and also because it's used in a similar way in dask. The Group class has methods "create_dataset" and "require_dataset" for consistency with h5py, but also has all the array creation methods like "empty", "zeros", "ones", etc., which should be more convenient in most cases.

Very cool that work is going ahead to put PyTables on top of h5py. Would be good to keep in touch as that develops, to see if there would be anything that could be done in Zarr to facilitate usage as a storage backend.

@mrocklin
Copy link
Contributor

PyTables on top of Zarr would be very cool. In my world PyTables on top of
Zarr on top of S3 would be an interesting migration path for a large body
of users towards partitioned non-filesystem storage systems.

On Fri, Aug 26, 2016 at 11:15 AM, Alistair Miles [email protected]
wrote:

Thanks @mrocklin https://github.com/mrocklin, @shoyer
https://github.com/shoyer, @FrancescAlted
https://github.com/FrancescAlted for all the input. Here's what I've
done.

The Group class now extends collections.Mapping. Implementations of keys(),
values() and items() have been removed as these are provided by
superclass. Also setitem definition has been removed which means
Python now automatically raises TypeError if someone tries to do item
assignment.

The Group.create_dataset() method still accepts additional keyword
arguments to ease the transition for users coming from h5py, but a user
warning is generated and the arguments are ignored. This seemed more
straightforward.

Also you can now provide separate stores for metadata and chunks. E.g.:

import zarr>>> store = dict()>>> chunk_store = dict()>>> g = zarr.group(store, chunk_store=chunk_store)>>> a = g.create_dataset('foo/bar', shape=(20, 20), chunks=(10, 10))>>> a[:] = 42>>> sorted(store.keys())
['.zattrs', '.zgroup', 'foo/.zattrs', 'foo/.zgroup', 'foo/bar/.zarray', 'foo/bar/.zattrs']>>> sorted(chunk_store.keys())
['foo/bar/0.0', 'foo/bar/0.1', 'foo/bar/1.0', 'foo/bar/1.1']

If chunk_store is not provided, chunks and metadata will be stored
together in store as before.

@shoyer https://github.com/shoyer yes there could be problems with
JSON. I've added a test for using np.nan as a fill_value and that seems to
work without any special handling, but other fill values could cause
problems. I propose to stick with JSON for version 2 and see how things
work out, and consider alternatives for version 3.

@FrancescAlted https://github.com/FrancescAlted I completely agree
about naming. Especially I don't like "chunks" as a parameter name either,
it's not intuitive at all. I'm trying to strike a balance between
familiarity for users (like me) using zarr as a replacement for h5py, and
making things clearer where possible. I've left the "chunks" parameter
as-is for consistency with h5py, and also because it's used in a similar
way in dask. The Group class has methods "create_dataset" and
"require_dataset" for consistency with h5py, but also has all the array
creation methods like "empty", "zeros", "ones", etc., which should be more
convenient in most cases.

Very cool that work is going ahead to put PyTables on top of h5py. Would
be good to keep in touch as that develops, to see if there would be
anything that could be done in Zarr to facilitate usage as a storage
backend.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37#issuecomment-242763939, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszDmPh31gVv_71OUKBk3ySxAyUpeEks5qjwMGgaJpZM4JUIhp
.

@shoyer
Copy link
Contributor

shoyer commented Aug 26, 2016

NaN works with Python's JSON module, but there are many libraries that will
fail to parse a JSON file with it. We ran into exactly this issue with an
array store we wrote at my former employer. I definitely recommend using
the string "NaN" for this reason, and I think it's worth encoding in the
spec for exactly this reason.
On Fri, Aug 26, 2016 at 8:15 AM Alistair Miles [email protected]
wrote:

Thanks @mrocklin https://github.com/mrocklin, @shoyer
https://github.com/shoyer, @FrancescAlted
https://github.com/FrancescAlted for all the input. Here's what I've
done.

The Group class now extends collections.Mapping. Implementations of keys(),
values() and items() have been removed as these are provided by
superclass. Also setitem definition has been removed which means
Python now automatically raises TypeError if someone tries to do item
assignment.

The Group.create_dataset() method still accepts additional keyword
arguments to ease the transition for users coming from h5py, but a user
warning is generated and the arguments are ignored. This seemed more
straightforward.

Also you can now provide separate stores for metadata and chunks. E.g.:

import zarr>>> store = dict()>>> chunk_store = dict()>>> g = zarr.group(store, chunk_store=chunk_store)>>> a = g.create_dataset('foo/bar', shape=(20, 20), chunks=(10, 10))>>> a[:] = 42>>> sorted(store.keys())
['.zattrs', '.zgroup', 'foo/.zattrs', 'foo/.zgroup', 'foo/bar/.zarray', 'foo/bar/.zattrs']>>> sorted(chunk_store.keys())
['foo/bar/0.0', 'foo/bar/0.1', 'foo/bar/1.0', 'foo/bar/1.1']

If chunk_store is not provided, chunks and metadata will be stored
together in store as before.

@shoyer https://github.com/shoyer yes there could be problems with
JSON. I've added a test for using np.nan as a fill_value and that seems to
work without any special handling, but other fill values could cause
problems. I propose to stick with JSON for version 2 and see how things
work out, and consider alternatives for version 3.

@FrancescAlted https://github.com/FrancescAlted I completely agree
about naming. Especially I don't like "chunks" as a parameter name either,
it's not intuitive at all. I'm trying to strike a balance between
familiarity for users (like me) using zarr as a replacement for h5py, and
making things clearer where possible. I've left the "chunks" parameter
as-is for consistency with h5py, and also because it's used in a similar
way in dask. The Group class has methods "create_dataset" and
"require_dataset" for consistency with h5py, but also has all the array
creation methods like "empty", "zeros", "ones", etc., which should be more
convenient in most cases.

Very cool that work is going ahead to put PyTables on top of h5py. Would
be good to keep in touch as that develops, to see if there would be
anything that could be done in Zarr to facilitate usage as a storage
backend.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37#issuecomment-242763939, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1sz5Qg4LznUJyG5YNtEapPOQdvbwks5qjwMGgaJpZM4JUIhp
.

@alimanfoo
Copy link
Member Author

On Friday, August 26, 2016, Stephan Hoyer [email protected] wrote:

NaN works with Python's JSON module, but there are many libraries that will
fail to parse a JSON file with it. We ran into exactly this issue with an
array store we wrote at my former employer. I definitely recommend using
the string "NaN" for this reason, and I think it's worth encoding in the
spec for exactly this reason.

Thanks Stephan, I've changed this in implementation and spec.

On Fri, Aug 26, 2016 at 8:15 AM Alistair Miles <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');>
wrote:

Thanks @mrocklin https://github.com/mrocklin, @shoyer
https://github.com/shoyer, @FrancescAlted
https://github.com/FrancescAlted for all the input. Here's what I've
done.

The Group class now extends collections.Mapping. Implementations of
keys(),
values() and items() have been removed as these are provided by
superclass. Also setitem definition has been removed which means
Python now automatically raises TypeError if someone tries to do item
assignment.

The Group.create_dataset() method still accepts additional keyword
arguments to ease the transition for users coming from h5py, but a user
warning is generated and the arguments are ignored. This seemed more
straightforward.

Also you can now provide separate stores for metadata and chunks. E.g.:

import zarr>>> store = dict()>>> chunk_store = dict()>>> g =
zarr.group(store, chunk_store=chunk_store)>>> a =
g.create_dataset('foo/bar', shape=(20, 20), chunks=(10, 10))>>> a[:] =
42>>> sorted(store.keys())
['.zattrs', '.zgroup', 'foo/.zattrs', 'foo/.zgroup', 'foo/bar/.zarray',
'foo/bar/.zattrs']>>> sorted(chunk_store.keys())
['foo/bar/0.0', 'foo/bar/0.1', 'foo/bar/1.0', 'foo/bar/1.1']

If chunk_store is not provided, chunks and metadata will be stored
together in store as before.

@shoyer https://github.com/shoyer yes there could be problems with
JSON. I've added a test for using np.nan as a fill_value and that seems
to
work without any special handling, but other fill values could cause
problems. I propose to stick with JSON for version 2 and see how things
work out, and consider alternatives for version 3.

@FrancescAlted https://github.com/FrancescAlted I completely agree
about naming. Especially I don't like "chunks" as a parameter name
either,
it's not intuitive at all. I'm trying to strike a balance between
familiarity for users (like me) using zarr as a replacement for h5py, and
making things clearer where possible. I've left the "chunks" parameter
as-is for consistency with h5py, and also because it's used in a similar
way in dask. The Group class has methods "create_dataset" and
"require_dataset" for consistency with h5py, but also has all the array
creation methods like "empty", "zeros", "ones", etc., which should be
more
convenient in most cases.

Very cool that work is going ahead to put PyTables on top of h5py. Would
be good to keep in touch as that develops, to see if there would be
anything that could be done in Zarr to facilitate usage as a storage
backend.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37#issuecomment-242763939, or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
ABKS1sz5Qg4LznUJyG5YNtEapPOQdvbwks5qjwMGgaJpZM4JUIhp>
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37#issuecomment-242772793, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QuhSi7Y-adDMNjPjtjWk3U-vfzZeks5qjwqVgaJpZM4JUIhp
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@shoyer
Copy link
Contributor

shoyer commented Aug 26, 2016

Probability it's worth add "Infinity" and "-Infinity" alongside NaN, since those are also valid floats that aren't valid JSON.


Simple data types are encoded within the array metadata resource as a string,
following the `NumPy array protocol type string (typestr) format
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I suggest explicitly writing out the list of valid type codes and a brief description. I think that would be b, i, u, f, c, S and U? I'm not 100% sure it's worth including b or U because both are highly inefficiently, though I guess compression should alleviate that somewhat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've expanded the section a bit to include this.

On Friday, 26 August 2016, Stephan Hoyer [email protected] wrote:

In docs/spec/v2.rst
https://github.com/alimanfoo/zarr/pull/37#discussion_r76455077:

  •    "dtype": "<f8",
    
  •    "fill_value": null,
    
  •    "order": "C",
    
  •    "shape": [
    
  •        10000,
    
  •        10000
    
  •    ],
    
  •    "zarr_format": 2
    
  • }

+Data type encoding
+~~~~~~~~~~~~~~~~~~
+
+Simple data types are encoded within the array metadata resource as a string,
+following the NumPy array protocol type string (typestr) format +<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>_. The format

For completeness, I suggest explicitly writing out the list of valid type
codes and a brief description. I think that would be b, i, u, f, c, S and
U? I'm not 100% sure it's worth including b or U because both are highly
inefficiently, though I guess compression should alleviate that somewhat.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37/files/8c8dbab434f8cc1fc71d189330d35732564ebf56#r76455077,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QhYdNJVDTaELdMLcRaH4V5Wzlgtyks5qjyDTgaJpZM4JUIhp
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@alimanfoo
Copy link
Member Author

Added encoding for infinities.

On Friday, 26 August 2016, Stephan Hoyer [email protected] wrote:

Probability it's worth add "Infinity" and "-Infinity" alongside NaN,
since those are also valid floats that aren't valid JSON.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/pull/37#issuecomment-242795093, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8Qmgb9wgyRY4X9K9bEk5HMWZbysxDks5qjx8QgaJpZM4JUIhp
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@alimanfoo
Copy link
Member Author

Merging tomorrow if no further comments.

@alimanfoo alimanfoo merged commit cd10a57 into master Aug 30, 2016
@alimanfoo alimanfoo deleted the hierarchy branch August 30, 2016 14:41
@alimanfoo alimanfoo modified the milestone: v2.0 Aug 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants