-
-
Notifications
You must be signed in to change notification settings - Fork 329
Suppressing ZipFile duplication warning #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Proposed fix in PR ( https://github.com/alimanfoo/zarr/pull/130 ). |
FWIW I think this needs some consideration. If duplicate files are being written into a zip file, and this is happening often, then it is likely that something rather sub-optimal is happening. In the pathological case, a user could be storing many multiples of the actual data for an array without realising, then wonder why the zip file is so large. Writing directly to a zip file is really only efficient if the array or arrays being stored in the zip store are written only once, and write operations can be perfectly aligned with chunk boundaries, in which case no duplicate chunk files will ever get created. This can be achieved if an array is created with If the use case requires that data are written and then overwritten, and/or that write operations cannot be aligned with chunk boundaries, then a better approach is probably to initially store the data using |
It's a fair point honestly. Though I really do like the idea of operating on a single file that includes all of the array data. Zip is nice as it is easy to inspect. Though I don't feel wedded to it given the issues I'm already experiencing by playing with it. Is there some other reasonable storage type that we could add to Zarr that wouldn't have these limitations? |
I don't know of anything better. Tar is worse apparently as it doesn't support random access. It would be possible (if a little twisted) to use an HDF5 file, however you'd lose the ability to do multi-threaded reads (which seem to work on a zip store surprisingly). cc @mrocklin. |
When I looked into this a long while ago I found that yes, there are other single-file compression formats out there that support random access, but none seemed common place. Generally speaking writing variable sized byte blocks into a single file is a hard problem. Another alternative would be an embedded key-value database. Zict has a MutableMapping for LMDB. https://github.com/dask/zict/blob/master/zict/lmdb.py This would be a single directory rather than a single file, but balances large writes and many small writes well. |
Maybe shelve is an option? It supports the MutableMapping interface so you could probably just use a |
Ha, @mrocklin you get much kudos for advocating the MutableMapping interface...
|
Hooray standard interfaces! |
Looks like shelve supports multi-threaded reads...
|
A BerkeleyDB hash table would probably be another option. |
The fact that Zarr is using a |
See note above about LMDB, for which there is a MutableMapping in zict
…On Fri, Feb 24, 2017 at 6:38 PM, jakirkham ***@***.***> wrote:
The fact that Zarr is using a MutableMapping seems like a very useful
thing. Not that I have looked into this at all, but I wonder if there are
any Key-Value Stores that would work well here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/129#issuecomment-282434914>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszFkoGcuXmnALrIAYG1CLCzI38CG9ks5rf2njgaJpZM4MLhEG>
.
|
Yes, any key-value store should be an option. |
Thanks for the feedback. I'll give this some more thought. |
Kyoto cabinet could be another option, looks like Python bindings provide a
MutableMapping interface.
A nice feature of some of these key-value databases is support for
transactions.
On Fri, 24 Feb 2017 at 23:46, jakirkham ***@***.***> wrote:
Thanks for the feedback. I'll give this some more thought.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/129#issuecomment-282435999>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QibriMkQ9jBHivlOk05B5EtMfobEks5rf2u-gaJpZM4MLhEG>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
I added a little bit of Python code to zip up the directories after they are written to in such a way as to ensure Zarr can still load them. This is a good enough near term solution for my needs. Would be willing to contribute the utility function or perhaps add another store if there is interest. |
Maybe this could be a method on the DirectoryStore class? Called
to_zipfile() or archive() or something like that?
On Mon, 27 Feb 2017 at 21:06, jakirkham ***@***.***> wrote:
I added a little bit of Python code to zip up the directories after they
are written to in such a way as to ensure Zarr can still load them. This is
a good enough near term solution for my needs. Would be willing to
contribute the utility function or perhaps add another store if there is
interest.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/129#issuecomment-282853137>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QnxgiJ9GUE_-gKGl5qvisU50KuDlks5rgzq7gaJpZM4MLhEG>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721
|
Opened issue ( https://github.com/alimanfoo/zarr/issues/137 ) to keep track of this idea. |
Forgot to mention that |
Thanks, sounds good.
On Wed, 18 Oct 2017 at 19:14, jakirkham ***@***.***> wrote:
Forgot to mention that create_group, create_dataset, and open_group will
add an empty .zattrs entry to start with. Thus if the attributes need to
be set or modified afterwards, this will create duplicate .zattrs entries
in a Zip file. Have raised issue ( #121
<https://github.com/alimanfoo/zarr/issues/121> ) to allow attrs to be
specified in these creation functions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/129#issuecomment-337680666>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qt1uUqohGRrcK2lmeXzPbjF5uB6qks5stj_ogaJpZM4MLhEG>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Uh oh!
There was an error while loading. Please reload this page.
Python's ZipFile allows writing duplicate files and does this by default. When it writes a duplicate file, it raises a
UserWarning
. This occurs for each file and is a bit noisy. As there doesn't seem to be a standard way of solving this, would recommend that we simply suppress this warning. Combining this with a resolution to issue ( https://github.com/alimanfoo/zarr/issues/128 ), would ensure that deduplication already occurs so the warning is no longer relevant.Edit: Added link to Python bug after the fact.
The text was updated successfully, but these errors were encountered: