Add SQLiteStore #368

jakirkham · 2018-12-21T04:13:47Z

Fixes #365

Adds an SQLite-backed MutableMapping store. Performs operations on the underlying SQLite database that treat it like and expose it to users as a key-value store. Should provide a very portable and reliable storage format. Also should be a useful template for users trying to implement a store for their own database that supports SQL.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
Docs build locally (e.g., run tox -e docs)
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

Implements a key-value store using SQLite. As this is a builtin module in Python and a common database to use in various languages, this should have high utility and be very portable. Not to mention many databases provide an SQLite language on top regardless of the internal representation. So this can be a great template for users wishing to work with Zarr in their preferred database.

Try using the `SQLiteStore` everywhere one would use another store and make sure that it behaves correctly. This includes simple key-value store usage, creating hierarchies, and storing arrays.

Provide a few examples of how one might use `SQLiteStore` to store arrays or groups. These examples are taken with minor modifications from the `LMDBStore` examples.

Includes a simple example borrowed from `LMDBStore`'s tutorial example, which shows how to create and use an `SQLiteStore`.

Otherwise we may end up opening a different databases' files and try to use them with SQLite only to run into errors. This caused the doctests to fail previously. Changing the extension as we have done should avoid these conflicts.

Instead of opening, committing, and closing the SQLite database for every operation, limit these to user requested operations. Namely commit only when the user calls `flush`. Also close only when the user calls `close`. This should make operations with SQLite much more performant than when we automatically committed and closed after every user operation.

As users need to explicitly close the `SQLiteStore` to commit changes and serialize them to the SQLite database, make sure to point this out in the docs.

Appears some of these commands work without capitalization. However as the docs show commands as capitalized, ensure that we are doing the same thing as well. That way this won't run into issues with different SQL implementations or older versions of SQLite that are less forgiving. Plus this should match closer to what users familiar with SQL expect.

Make use of `in` instead of repeating the same logic in `__delitem__`. As we now keep the database open between operations, this is much simpler than duplicating the key check logic. Also makes it a bit easier to understand what is going on.

This was needed when the `import` of `sqlite3` was only here to ensure that it existed (even though it wasn't used). Now we make use of `sqlite3` where it is being imported. So there is no need to tell flake8 to not worry about the unused import as there isn't one.

Make sure that everything intended to be added to the `SQLiteStore` database has been written to disk before attempting to pickle it. That way we can be sure other processes constructing their own `SQLiteStore` have access to the same data and not some earlier representation.

No need to normalize the path when there isn't one (e.g. `:memory:`).

alimanfoo

Thanks, all looks good. Couple of tiny comments.

zarr/storage.py

docs/release.rst

alimanfoo · 2018-12-21T09:50:56Z

zarr/storage.py

+        kwargs.setdefault('timeout', 5.0)
+        kwargs.setdefault('detect_types', 0)
+        kwargs.setdefault('isolation_level', None)  # autocommit
+        kwargs.setdefault('check_same_thread', False)  # disallow writing from other threads


From the docs:

By default, check_same_thread is True and only the creating thread may use the connection. If set False, the returned connection may be shared across multiple threads. When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.

Maybe we should set this to True. The docs say any multithreaded use needs to be fully synchronized by the user. There is no mechanism in zarr for doing this - the synchronizer classes only prevent writing the same chunk at the same time, but they allow different chunks to be written concurrently, which the sqlite docs suggest could still cause corruption.

Yep, that was probably an oversight on my part. Have corrected this to True.

I think if we just make sure to flush after every write operation, we should be safe allowing this though.

After giving this some thought, it seems reasonable for us to always flush after mutating the database. For example other processes could be accessing the same database and we want to ensure they all see the same thing. This isn't too different from how we handle other stores.

Have pushed a few commits to tidy things up and implement this flushing with each mutation behavior. Also have added an update function to allow more efficient submission of multiple changes (so we only commit once).

Switched back to setting check_same_thread to False by default as we now always serialize data if a change is made.

Fix a typo. Co-Authored-By: jakirkham <[email protected]>

Include author and original issue in changelog entry. Co-Authored-By: jakirkham <[email protected]>

The default value for `check_same_thread` was previously set to `False` when in reality we want this check enabled. So set `check_same_thread` to `True`.

As users could change the setting of things like `check_same_thread` or they may try to access the same database from multiple threads or processes, make sure to flush any changes that would mutate the database.

As we now always commit after an operation that mutates the data, there is no need to commit before pickling the `SQLiteStore` object. After all the data should already be up-to-date in the database.

As everything should already be flushed to the database whenever the state is mutated, there is no need to perform this before closing.

Co-Authored-By: jakirkham <[email protected]>

jakirkham · 2019-01-03T02:35:28Z

Thanks for taking a look @alimanfoo. Have tried to fix the few nits and answer your questions. Please let me know if you have more thoughts. :)

Adds a simple check to ensure SQLite is new enough to enable thread-safe sharing of connections before setting `check_same_thread=True`. If SQLite is not new enough, set `check_same_thread=False`.

zarr/storage.py

As there are some concerns about keeping operations on the SQLite database sequential for thread-safety, acquire an internal lock when a DML operation occurs. This should ensure that only one modification can occur at a time regardless of whether the connection uses the serialized threading mode or not.

Uses all the same tests we use for SQLiteStore's on disk except it special cases the pickling test to ensure the `SQLiteStore` cannot be pickled if it is in-memory.

Simply use the `Connection`'s default arguments implicitly instead of explicitly setting them in the constructor.

Make sure to inherit directly from `unittest.TestCase` as well.

alimanfoo

Thanks @jakirkham. I think it would be worth allowing isolation_level to be overridden by the user via kwargs into __init__. Otherwise looks good to go.

jakirkham · 2019-01-12T21:26:42Z

Not sure whether that is a good idea. If we do that, we will need to add explicit transactions in the code and add additional code to check whether transactions should or should not be used (spoiler: they can't be used with autocommit mode, but need to be used otherwise). Though I could be missing something.

ref: http://charlesleifer.com/blog/going-fast-with-sqlite-and-python/
ref: https://stackoverflow.com/q/15856976

alimanfoo · 2019-01-14T10:08:36Z

Not sure whether that is a good idea. If we do that, we will need to add explicit transactions in the code and add additional code to check whether transactions should or should not be used (spoiler: they can't be used with autocommit mode, but need to be used otherwise).

I was really just thinking to allow someone who really knows what they're doing and wants to control transactions themselves to be able to do so. Exposing isolation_level as a parameter allows them to do so. If they set isolation_level to anything other than None (autocommit), then transaction management is entirely up to them. They can do that by accessing the .db and .cursor members of the store.

It is really just about leaving the door open for people to experiment with other values of isolation_level. I don't think there would be any need to change any other code within the store, the point is that the user/application might want to make their own decisions about where to place transaction boundaries. Although we could add some documentation like, "N.B., if you set an isolation_level other than None then you are responsible for beginning and committing transactions."

I don't feel strongly about this btw, if you think this is something to consider for later then happy to proceed as-is.

jakirkham · 2019-01-15T15:23:41Z

Well if their goal is to circumvent what we are doing with transactions, they can override self.db's isolation_level property. Personally would rather discuss with a user of this functionality to figure out what they are after. That avoids introducing features with rough edges.

alimanfoo · 2019-01-15T15:55:53Z

Fair enough, no objections.

…

On Tue, 15 Jan 2019 at 15:23, jakirkham ***@***.***> wrote: Well if their goal is to circumvent what we are doing with transactions, they can override self.db's isolation_level <https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.isolation_level>. Personally would rather discuss with a user of this functionality to figure out what they are after. That avoids introducing features with rough edges. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#368 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QhQGjVN-n7tRhifBrZMlXhsLi9VAks5vDfJ-gaJpZM4ZdVAM> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo> Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

jakirkham · 2019-01-15T18:21:24Z

Thanks @alimanfoo.

@jhamman, it looks like you have built some stuff on top of this PR. Are you happy with the contents here (for merging) or are there some things we still need to discuss/address?

jhamman · 2019-01-16T19:30:47Z

So I don't know SQL really at all so I'm not going to much use on the review here. You could say I'm as happy as I'm going to get here :) I'll give a quick browse just to be sure though...

I originally branched off this work thinking there may be some useful bits but none of those were used.

jhamman

just one little question

jhamman · 2019-01-16T19:35:34Z

zarr/storage.py

+            check_same_thread = False
+
+        # keep a lock for serializing mutable operations
+        self.lock = Lock()


Is there any value in allowing this lock (or the lock type) to be specified by the user? For example, we've found this sort of thing to be useful at times when using dask-distributed for I/O operations.

@alimanfoo and I were a little concerned by some of the wording in the docs about sqlite3 and thread safety as discussed here. Having this lock may be (overly) cautious. It's not clear what the right answer is here. Would be happy to hear other thoughts on this if you have any.

I think this is fine for now. We can certainly revisit this down the road if additional functionality is needed.

Thanks @jhamman. SGTM

jakirkham · 2019-01-19T21:53:18Z

Would someone like to do the honors here? ;)

jhamman · 2019-01-21T22:49:26Z

@jakirkham - I'd be happy to see you merge this. It sounds like @alimanfoo has signed off on this going in.

alimanfoo · 2019-01-21T22:57:06Z

Yep @jakirkham I think you should get the satisfaction of clicking the merge button here 😄

jakirkham · 2019-01-22T12:54:06Z

Thanks all 😄

jakirkham changed the title ~~WIP: Add sqlite store~~ WIP: Add SQLiteStore Dec 21, 2018

jakirkham added 8 commits December 21, 2018 00:50

Test SQLiteStore

e3e2c2e

Try using the `SQLiteStore` everywhere one would use another store and make sure that it behaves correctly. This includes simple key-value store usage, creating hierarchies, and storing arrays.

Export SQLiteStore to the top-level namespace

d60aaab

Include some SQLiteStore examples

ace251c

Provide a few examples of how one might use `SQLiteStore` to store arrays or groups. These examples are taken with minor modifications from the `LMDBStore` examples.

Demonstrate the SQLiteStore in the tutorial

ecf18f7

Includes a simple example borrowed from `LMDBStore`'s tutorial example, which shows how to create and use an `SQLiteStore`.

Provide API documentation for SQLiteStore

92a4d71

Make a release note for SQLiteStore

efa9ccd

Use unique extension for SQLiteStore files

6f68451

Otherwise we may end up opening a different databases' files and try to use them with SQLite only to run into errors. This caused the doctests to fail previously. Changing the extension as we have done should avoid these conflicts.

jakirkham changed the title ~~WIP: Add SQLiteStore~~ Add SQLiteStore Dec 21, 2018

jakirkham added this to the v2.3 milestone Dec 21, 2018

jakirkham added 3 commits December 21, 2018 01:50

Update docs to show how to close SQLiteStore

20ef384

As users need to explicitly close the `SQLiteStore` to commit changes and serialize them to the SQLite database, make sure to point this out in the docs.

jakirkham mentioned this pull request Dec 21, 2018

Add SQLite #365

Closed

jakirkham added 5 commits December 21, 2018 02:10

Simplify SQLiteStore's __delitem__ using in

5ddc193

Make use of `in` instead of repeating the same logic in `__delitem__`. As we now keep the database open between operations, this is much simpler than duplicating the key check logic. Also makes it a bit easier to understand what is going on.

Simplify close and use flush

b339c09

Special case in-memory SQLite database

1cac5eb

No need to normalize the path when there isn't one (e.g. `:memory:`).

jakirkham requested a review from alimanfoo December 21, 2018 09:12

alimanfoo approved these changes Dec 21, 2018

View reviewed changes

zarr/storage.py Outdated Show resolved Hide resolved

docs/release.rst Outdated Show resolved Hide resolved

docs/release.rst Outdated Show resolved Hide resolved

alimanfoo reviewed Dec 21, 2018

View reviewed changes

jakirkham and others added 7 commits December 21, 2018 09:48

Drop unneeded empty return statement

b8e2d23

Update docs/release.rst

4db7e14

Fix a typo. Co-Authored-By: jakirkham <[email protected]>

Update docs/release.rst

31a9af3

Include author and original issue in changelog entry. Co-Authored-By: jakirkham <[email protected]>

Correct default value for check_same_thread

9f5d02b

The default value for `check_same_thread` was previously set to `False` when in reality we want this check enabled. So set `check_same_thread` to `True`.

Flush after making any mutation to the database

ac6827e

As users could change the setting of things like `check_same_thread` or they may try to access the same database from multiple threads or processes, make sure to flush any changes that would mutate the database.

Skip flushing data when pickling SQLiteStore

8b35eb8

As we now always commit after an operation that mutates the data, there is no need to commit before pickling the `SQLiteStore` object. After all the data should already be up-to-date in the database.

Skip using flush in close

f8d3f03

As everything should already be flushed to the database whenever the state is mutated, there is no need to perform this before closing.

alimanfoo and others added 2 commits January 2, 2019 20:38

Update docs/release.rst

d55ac16

Co-Authored-By: jakirkham <[email protected]>

TestSQLiteStore -> TestGroupWithSQLiteStore

996fd77

jakirkham added 4 commits January 3, 2019 13:47

Drop else in for/else for clarity

d268144

Ensure SQLite is new enough to enable threading

207565d

Adds a simple check to ensure SQLite is new enough to enable thread-safe sharing of connections before setting `check_same_thread=True`. If SQLite is not new enough, set `check_same_thread=False`.

Add spacing around =

7e86d3e

Merge 'zarr-developers/master' into 'jakirkham/add_sqlite_store'

043eec4

alimanfoo reviewed Jan 3, 2019

View reviewed changes

zarr/storage.py Outdated Show resolved Hide resolved

alimanfoo reviewed Jan 3, 2019

View reviewed changes

zarr/storage.py Show resolved Hide resolved

jakirkham added 5 commits January 3, 2019 18:19

Raise when pickling an in-memory SQLite database

c65f78f

Test in-memory SQLiteStore's separately

505ac5f

Uses all the same tests we use for SQLiteStore's on disk except it special cases the pickling test to ensure the `SQLiteStore` cannot be pickled if it is in-memory.

Drop explicit setting of sqlite3 defaults

0bad6c5

Simply use the `Connection`'s default arguments implicitly instead of explicitly setting them in the constructor.

Adjust inheritance of TestSQLiteStoreInMemory

0dc34bb

Make sure to inherit directly from `unittest.TestCase` as well.

alimanfoo reviewed Jan 4, 2019

View reviewed changes

Merge 'zarr-developers/master' into 'jakirkham/add_sqlite_store'

f5b8913

jhamman approved these changes Jan 16, 2019

View reviewed changes

jakirkham merged commit 43f7fae into zarr-developers:master Jan 22, 2019

jakirkham deleted the add_sqlite_store branch January 22, 2019 12:52

jhamman mentioned this pull request Feb 4, 2019

running out of memory trying to write SQL pydata/xarray#1874

Closed

Uh oh!

Add SQLiteStore #368

Add SQLiteStore #368

Uh oh!

Conversation

jakirkham commented Dec 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alimanfoo Dec 21, 2018

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 21, 2018

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 21, 2018

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jan 3, 2019

Uh oh!

Uh oh!

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jan 12, 2019

Uh oh!

alimanfoo commented Jan 14, 2019

Uh oh!

jakirkham commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Jan 15, 2019 via email

Uh oh!

jakirkham commented Jan 15, 2019

Uh oh!

jhamman commented Jan 16, 2019

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

jhamman Jan 16, 2019

Choose a reason for hiding this comment

Uh oh!

jakirkham Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

jhamman Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

jakirkham Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jan 19, 2019

Uh oh!

jhamman commented Jan 21, 2019

Uh oh!

alimanfoo commented Jan 21, 2019

Uh oh!

jakirkham commented Jan 22, 2019

Uh oh!

Uh oh!

jakirkham commented Dec 21, 2018 •

edited

Loading

jakirkham commented Jan 15, 2019 •

edited

Loading