Suppressing ZipFile duplication warning #129

jakirkham · 2017-02-24T18:17:55Z

Python's ZipFile allows writing duplicate files and does this by default. When it writes a duplicate file, it raises a UserWarning. This occurs for each file and is a bit noisy. As there doesn't seem to be a standard way of solving this, would recommend that we simply suppress this warning. Combining this with a resolution to issue ( https://github.com/alimanfoo/zarr/issues/128 ), would ensure that deduplication already occurs so the warning is no longer relevant.

Edit: Added link to Python bug after the fact.

The text was updated successfully, but these errors were encountered:

jakirkham · 2017-02-24T18:41:34Z

Proposed fix in PR ( https://github.com/alimanfoo/zarr/pull/130 ).

alimanfoo · 2017-02-24T22:33:17Z

FWIW I think this needs some consideration. If duplicate files are being written into a zip file, and this is happening often, then it is likely that something rather sub-optimal is happening. In the pathological case, a user could be storing many multiples of the actual data for an array without realising, then wonder why the zip file is so large.

Writing directly to a zip file is really only efficient if the array or arrays being stored in the zip store are written only once, and write operations can be perfectly aligned with chunk boundaries, in which case no duplicate chunk files will ever get created. This can be achieved if an array is created with zarr.array(data=data, ...) or with something like z = zarr.zeros(...); z[:] = data, as in either case zarr internally aligns the write operations with chunk boundaries.

If the use case requires that data are written and then overwritten, and/or that write operations cannot be aligned with chunk boundaries, then a better approach is probably to initially store the data using DirectoryStore. Then when all writing has finished, the containing directory can be stored into a zip file using the standard command line zip utility. The resulting zip file can then be read directly by zarr without having to unpack. This should be at least as efficient in general as deduplicating a zip file on close by copying to a new zip file.

jakirkham · 2017-02-24T22:45:00Z

It's a fair point honestly. Though I really do like the idea of operating on a single file that includes all of the array data. Zip is nice as it is easy to inspect. Though I don't feel wedded to it given the issues I'm already experiencing by playing with it. Is there some other reasonable storage type that we could add to Zarr that wouldn't have these limitations?

alimanfoo · 2017-02-24T22:54:50Z

I don't know of anything better. Tar is worse apparently as it doesn't support random access. It would be possible (if a little twisted) to use an HDF5 file, however you'd lose the ability to do multi-threaded reads (which seem to work on a zip store surprisingly). cc @mrocklin.

mrocklin · 2017-02-24T23:03:47Z

When I looked into this a long while ago I found that yes, there are other single-file compression formats out there that support random access, but none seemed common place. Generally speaking writing variable sized byte blocks into a single file is a hard problem.

Another alternative would be an embedded key-value database. Zict has a MutableMapping for LMDB. https://github.com/dask/zict/blob/master/zict/lmdb.py This would be a single directory rather than a single file, but balances large writes and many small writes well.

alimanfoo · 2017-02-24T23:05:53Z

Maybe shelve is an option? It supports the MutableMapping interface so you could probably just use a Shelf as a store without needing to write any new code...

alimanfoo · 2017-02-24T23:10:43Z

Ha, @mrocklin you get much kudos for advocating the MutableMapping interface...

Python 3.5.3 | packaged by conda-forge | (default, Jan 23 2017, 19:01:48) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> zarr.__version__
'2.1.4'
>>> import numpy as np
>>> import shelve
>>> store = shelve.open('shelf')
>>> z = zarr.array(data=np.arange(1000), store=store, chunks=100)
>>> np.all(z[:] == np.arange(1000))
True
>>> sorted(store)
['.zarray', '.zattrs', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

mrocklin · 2017-02-24T23:20:10Z

Hooray standard interfaces!

alimanfoo · 2017-02-24T23:34:00Z

Looks like shelve supports multi-threaded reads...

In [1]: import zarr

In [2]: import shelve

In [4]: store = shelve.open('shelf')

In [5]: z = zarr.zeros(10000000000, dtype='i4', chunks=1000000, store=store, overwrite
   ...: =True)

In [7]: %time z[:] = 42
CPU times: user 18.1 s, sys: 808 ms, total: 18.9 s
Wall time: 5.55 s

In [8]: !ls -lh shelf*
-rw-r--r-- 1 aliman aliman   25 Feb 24 23:29 shelf.bak
-rw-r--r-- 1 aliman aliman 199M Feb 24 23:29 shelf.dat
-rw-r--r-- 1 aliman aliman 259K Feb 24 23:29 shelf.dir

In [9]: z
Out[9]: 
Array((10000000000,), int32, chunks=(1000000,), order=C)
  nbytes: 37.3G; initialized: 10000/10000
  compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
  store: DbfilenameShelf

In [10]: import dask.array as da

In [11]: d = da.from_array(z, chunks=z.chunks)

In [14]: %time d.mean().compute()
CPU times: user 1min 13s, sys: 1.25 s, total: 1min 14s
Wall time: 12.6 s
Out[14]: 42.0

alimanfoo · 2017-02-24T23:35:01Z

A BerkeleyDB hash table would probably be another option.

jakirkham · 2017-02-24T23:38:11Z

The fact that Zarr is using a MutableMapping seems like a very useful thing. Not that I have looked into this at all, but I wonder if there are any Key-Value Stores that would work well here.

mrocklin · 2017-02-24T23:40:40Z

See note above about LMDB, for which there is a MutableMapping in zict

…

On Fri, Feb 24, 2017 at 6:38 PM, jakirkham ***@***.***> wrote: The fact that Zarr is using a MutableMapping seems like a very useful thing. Not that I have looked into this at all, but I wonder if there are any Key-Value Stores that would work well here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/129#issuecomment-282434914>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFkoGcuXmnALrIAYG1CLCzI38CG9ks5rf2njgaJpZM4MLhEG> .

alimanfoo · 2017-02-24T23:41:48Z

Yes, any key-value store should be an option.

jakirkham · 2017-02-24T23:46:05Z

Thanks for the feedback. I'll give this some more thought.

alimanfoo · 2017-02-25T13:08:25Z

Kyoto cabinet could be another option, looks like Python bindings provide a MutableMapping interface. A nice feature of some of these key-value databases is support for transactions.

On Fri, 24 Feb 2017 at 23:46, jakirkham ***@***.***> wrote: Thanks for the feedback. I'll give this some more thought. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/129#issuecomment-282435999>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QibriMkQ9jBHivlOk05B5EtMfobEks5rf2u-gaJpZM4MLhEG> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: [email protected] Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

jakirkham · 2017-02-27T21:06:03Z

I added a little bit of Python code to zip up the directories after they are written to in such a way as to ensure Zarr can still load them. This is a good enough near term solution for my needs. Would be willing to contribute the utility function or perhaps add another store if there is interest.

alimanfoo · 2017-02-27T21:15:36Z

Maybe this could be a method on the DirectoryStore class? Called to_zipfile() or archive() or something like that?

On Mon, 27 Feb 2017 at 21:06, jakirkham ***@***.***> wrote: I added a little bit of Python code to zip up the directories after they are written to in such a way as to ensure Zarr can still load them. This is a good enough near term solution for my needs. Would be willing to contribute the utility function or perhaps add another store if there is interest. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/129#issuecomment-282853137>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QnxgiJ9GUE_-gKGl5qvisU50KuDlks5rgzq7gaJpZM4MLhEG> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: [email protected] Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

jakirkham · 2017-03-01T16:55:36Z

Opened issue ( https://github.com/alimanfoo/zarr/issues/137 ) to keep track of this idea.

jakirkham · 2017-10-18T18:14:00Z

Forgot to mention that create_group, create_dataset, and open_group will add an empty .zattrs entry to start with. Thus if the attributes need to be set or modified afterwards, this will create duplicate .zattrs entries in a Zip file. Have raised issue ( https://github.com/alimanfoo/zarr/issues/121 ) to allow attrs to be specified in these creation functions.

alimanfoo · 2017-10-18T20:38:48Z

Thanks, sounds good.

On Wed, 18 Oct 2017 at 19:14, jakirkham ***@***.***> wrote: Forgot to mention that create_group, create_dataset, and open_group will add an empty .zattrs entry to start with. Thus if the attributes need to be set or modified afterwards, this will create duplicate .zattrs entries in a Zip file. Have raised issue ( #121 <https://github.com/alimanfoo/zarr/issues/121> ) to allow attrs to be specified in these creation functions. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/129#issuecomment-337680666>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qt1uUqohGRrcK2lmeXzPbjF5uB6qks5stj_ogaJpZM4MLhEG> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham mentioned this issue Feb 24, 2017

Suppress warning about duplicate write to zip #130

Closed

alimanfoo mentioned this issue Feb 24, 2017

Add persistence examples using shelve/dbm, lmdb #131

Closed

jakirkham closed this as completed Feb 27, 2017

jakirkham mentioned this issue Mar 1, 2017

Store conversion methods #137

Closed

Uh oh!

Suppressing ZipFile duplication warning #129

Suppressing ZipFile duplication warning #129

Comments

jakirkham commented Feb 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jakirkham commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

jakirkham commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

mrocklin commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

mrocklin commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

jakirkham commented Feb 24, 2017

Uh oh!

mrocklin commented Feb 24, 2017 via email

Uh oh!

alimanfoo commented Feb 24, 2017

Uh oh!

jakirkham commented Feb 24, 2017

Uh oh!

alimanfoo commented Feb 25, 2017 via email

Uh oh!

jakirkham commented Feb 27, 2017

Uh oh!

alimanfoo commented Feb 27, 2017 via email

Uh oh!

jakirkham commented Mar 1, 2017

Uh oh!

jakirkham commented Oct 18, 2017

Uh oh!

alimanfoo commented Oct 18, 2017 via email

Uh oh!

jakirkham commented Feb 24, 2017 •

edited

Loading