-
-
Notifications
You must be signed in to change notification settings - Fork 329
Add SQLiteStore #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQLiteStore #368
Conversation
Implements a key-value store using SQLite. As this is a builtin module in Python and a common database to use in various languages, this should have high utility and be very portable. Not to mention many databases provide an SQLite language on top regardless of the internal representation. So this can be a great template for users wishing to work with Zarr in their preferred database.
Try using the `SQLiteStore` everywhere one would use another store and make sure that it behaves correctly. This includes simple key-value store usage, creating hierarchies, and storing arrays.
Provide a few examples of how one might use `SQLiteStore` to store arrays or groups. These examples are taken with minor modifications from the `LMDBStore` examples.
Includes a simple example borrowed from `LMDBStore`'s tutorial example, which shows how to create and use an `SQLiteStore`.
Otherwise we may end up opening a different databases' files and try to use them with SQLite only to run into errors. This caused the doctests to fail previously. Changing the extension as we have done should avoid these conflicts.
Instead of opening, committing, and closing the SQLite database for every operation, limit these to user requested operations. Namely commit only when the user calls `flush`. Also close only when the user calls `close`. This should make operations with SQLite much more performant than when we automatically committed and closed after every user operation.
As users need to explicitly close the `SQLiteStore` to commit changes and serialize them to the SQLite database, make sure to point this out in the docs.
Appears some of these commands work without capitalization. However as the docs show commands as capitalized, ensure that we are doing the same thing as well. That way this won't run into issues with different SQL implementations or older versions of SQLite that are less forgiving. Plus this should match closer to what users familiar with SQL expect.
Make use of `in` instead of repeating the same logic in `__delitem__`. As we now keep the database open between operations, this is much simpler than duplicating the key check logic. Also makes it a bit easier to understand what is going on.
This was needed when the `import` of `sqlite3` was only here to ensure that it existed (even though it wasn't used). Now we make use of `sqlite3` where it is being imported. So there is no need to tell flake8 to not worry about the unused import as there isn't one.
Make sure that everything intended to be added to the `SQLiteStore` database has been written to disk before attempting to pickle it. That way we can be sure other processes constructing their own `SQLiteStore` have access to the same data and not some earlier representation.
No need to normalize the path when there isn't one (e.g. `:memory:`).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, all looks good. Couple of tiny comments.
zarr/storage.py
Outdated
kwargs.setdefault('timeout', 5.0) | ||
kwargs.setdefault('detect_types', 0) | ||
kwargs.setdefault('isolation_level', None) # autocommit | ||
kwargs.setdefault('check_same_thread', False) # disallow writing from other threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs:
By default, check_same_thread is True and only the creating thread may use the connection. If set False, the returned connection may be shared across multiple threads. When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.
Maybe we should set this to True. The docs say any multithreaded use needs to be fully synchronized by the user. There is no mechanism in zarr for doing this - the synchronizer classes only prevent writing the same chunk at the same time, but they allow different chunks to be written concurrently, which the sqlite docs suggest could still cause corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that was probably an oversight on my part. Have corrected this to True
.
I think if we just make sure to flush
after every write operation, we should be safe allowing this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After giving this some thought, it seems reasonable for us to always flush
after mutating the database. For example other processes could be accessing the same database and we want to ensure they all see the same thing. This isn't too different from how we handle other stores.
Have pushed a few commits to tidy things up and implement this flush
ing with each mutation behavior. Also have added an update
function to allow more efficient submission of multiple changes (so we only commit once).
Switched back to setting check_same_thread
to False
by default as we now always serialize data if a change is made.
Fix a typo. Co-Authored-By: jakirkham <[email protected]>
Include author and original issue in changelog entry. Co-Authored-By: jakirkham <[email protected]>
The default value for `check_same_thread` was previously set to `False` when in reality we want this check enabled. So set `check_same_thread` to `True`.
As users could change the setting of things like `check_same_thread` or they may try to access the same database from multiple threads or processes, make sure to flush any changes that would mutate the database.
As we now always commit after an operation that mutates the data, there is no need to commit before pickling the `SQLiteStore` object. After all the data should already be up-to-date in the database.
As everything should already be flushed to the database whenever the state is mutated, there is no need to perform this before closing.
Co-Authored-By: jakirkham <[email protected]>
Thanks for taking a look @alimanfoo. Have tried to fix the few nits and answer your questions. Please let me know if you have more thoughts. :) |
Adds a simple check to ensure SQLite is new enough to enable thread-safe sharing of connections before setting `check_same_thread=True`. If SQLite is not new enough, set `check_same_thread=False`.
As there are some concerns about keeping operations on the SQLite database sequential for thread-safety, acquire an internal lock when a DML operation occurs. This should ensure that only one modification can occur at a time regardless of whether the connection uses the serialized threading mode or not.
Uses all the same tests we use for SQLiteStore's on disk except it special cases the pickling test to ensure the `SQLiteStore` cannot be pickled if it is in-memory.
Simply use the `Connection`'s default arguments implicitly instead of explicitly setting them in the constructor.
Make sure to inherit directly from `unittest.TestCase` as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jakirkham. I think it would be worth allowing isolation_level
to be overridden by the user via kwargs into __init__
. Otherwise looks good to go.
Not sure whether that is a good idea. If we do that, we will need to add explicit transactions in the code and add additional code to check whether transactions should or should not be used (spoiler: they can't be used with autocommit mode, but need to be used otherwise). Though I could be missing something. ref: http://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ |
I was really just thinking to allow someone who really knows what they're doing and wants to control transactions themselves to be able to do so. Exposing It is really just about leaving the door open for people to experiment with other values of I don't feel strongly about this btw, if you think this is something to consider for later then happy to proceed as-is. |
Well if their goal is to circumvent what we are doing with transactions, they can override |
Fair enough, no objections.
…On Tue, 15 Jan 2019 at 15:23, jakirkham ***@***.***> wrote:
Well if their goal is to circumvent what we are doing with transactions,
they can override self.db's isolation_level
<https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.isolation_level>.
Personally would rather discuss with a user of this functionality to figure
out what they are after. That avoids introducing features with rough edges.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#368 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QhQGjVN-n7tRhifBrZMlXhsLi9VAks5vDfJ-gaJpZM4ZdVAM>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Thanks @alimanfoo. @jhamman, it looks like you have built some stuff on top of this PR. Are you happy with the contents here (for merging) or are there some things we still need to discuss/address? |
So I don't know SQL really at all so I'm not going to much use on the review here. You could say I'm as happy as I'm going to get here :) I'll give a quick browse just to be sure though... I originally branched off this work thinking there may be some useful bits but none of those were used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one little question
check_same_thread = False | ||
|
||
# keep a lock for serializing mutable operations | ||
self.lock = Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value in allowing this lock (or the lock type) to be specified by the user? For example, we've found this sort of thing to be useful at times when using dask-distributed for I/O operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimanfoo and I were a little concerned by some of the wording in the docs about sqlite3
and thread safety as discussed here. Having this lock may be (overly) cautious. It's not clear what the right answer is here. Would be happy to hear other thoughts on this if you have any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now. We can certainly revisit this down the road if additional functionality is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jhamman. SGTM
Would someone like to do the honors here? ;) |
@jakirkham - I'd be happy to see you merge this. It sounds like @alimanfoo has signed off on this going in. |
Yep @jakirkham I think you should get the satisfaction of clicking the merge button here 😄 |
Thanks all 😄 |
Fixes #365
Adds an SQLite-backed
MutableMapping
store. Performs operations on the underlying SQLite database that treat it like and expose it to users as a key-value store. Should provide a very portable and reliable storage format. Also should be a useful template for users trying to implement a store for their own database that supports SQL.TODO:
tox -e docs
)