-
-
Notifications
You must be signed in to change notification settings - Fork 276
File object is not thread-safe #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmmm This is disconcerting... Thanks for pointing out @will133. I'll have to look at this more carefully. I think what we'll want to end up implementing is to keep the current method but add a thread-safe bypass to the API for when needed:
The default to this should be False to keep the current behavior enabled. If you end up with a pull request which implements this, I would be happy to review it. |
I ran into the same issue some time ago when I was opening hdf5 files from a cherrypy web-server.
I think solutions 1 is easier and less error prone. I am not sure how much performance benefit one actually gets from caching the file handle (maybe test it with a file with huge amounts of groups/nodes). |
I agree with @timeu. Solution 1 (don't cache the file handle) is the one that we should use as a bypass for thread safety. |
FYI...I started an attempt at fixing this. I'm going with solution 1 above - add a |
Hi @joshayers, please consider that bypassing the IMO there are two design alternatives:
Also, unless you are 100% sure that you are able to fix all possible threading issues it would be better, IMO, to use a different name for the keyword argument, maybe |
In my code I usually wrap handle with the with statement, so closing is usually not a huge issue. I think the safest thing to do is to not cache and store the file handles in set instead of a dictionary from fileName => handle. Thus when the program exit it'll just close all the file handles in the set. I think any attempt to reuse the file handle is pretty dangerous and properly hard to get right. |
Hey all, I'm happy to implement a solution for this, as long as the project committers agree on an approach. The possible solutions listed in this thread are:
In general, I'd prefer the first solution. I think it's the grown-up thing to do - if you open it, you close it. But the second is probably the best if there are legions of people out there not closing their own files. Anyway, I'd really like to see a solution to this land. Rolling up our data into pytables files is a bottleneck in our code right now and I'd hate to have to re-think our whole concurrency model because one library doesn't allow you to safely open multiple files in their own threads. Thanks! |
Hi @clutchski, thanks for taking this on. I think by virtue of actually working on this problem, you get to make some of the decisions. I haven't thought about the full repercussions of just removing the _open_files cache, though I am generally in favor of this (grownup thing to do etc). I would encourage you to look into this! Alternatively, as I said at the top of this issue, given what might break, I think adding a keyword argument to openFile() bypass the cache is very reasonable: f = tables.openFile(..., threadsafe=True) I'd like to see this issue resolved for you. |
Thanks for looking into this problem! I don't know about other uses cases (like how often the cache is hit) but here is what I'd prefer: By default, do not do caching at all but put each file handle in a set (hash by reference instead of the file name string). Thus, the exit function can close those file handles separately. If you want to do caching and you want to reuse the handle, I think it should be the user's responsibility to do so. It's just pretty confusing now since it doesn't quite map one-to-one to the underlying hdf5 handle and it doesn't really do what the user expect it to do. |
Cool. I'll take a stab and keep you posted. |
IMHO pytables should not bother to do any caching. edit: an exception perhaps can be made for external links (in that case the dictionary can be put in File). I'm religiously against maintaining a global state :) |
Does anyone have any code they would like to see for this that should go into v3.0? Now is your chance, even if it is only partially done. |
So the option that I am seeing as being the easiest to implement and the most stable is just to remove the |
@avalentino Should we still keep the current caching mechanism around though? And then make the user ask for caching explicitly? (Also, when a user closes a file, its id should be removed from this set, right?) Als o, is this something that you want to implement? I'd be happy to tackle #61 if you want to take of this ;) |
no, just dropping the caching part (reuse of file handles) and keep the part that takes track of open files so that we can close them at exit. And yes, files that are closes explicitly by the user should be removed by the set. In any case this is a very delicate matter IMO and it can be addressed without API changes. |
Ok. I am going to move this to v3.1. |
Will this issue fix in 3.1 allow me to revert back to accessing the same data file in multiple threads at once? I am running PyTables 2.4 on apache mod_wsgi to serve data files that are opened in read-only mode. Here are some different execution models Running 10 clients at about 20 requests/second Method A: threads=15 processes=1, files opened read only, files under a shared flock, fails (see exception below) The exception that gets triggered the most when using Method A is, however there are many exceptions that get thrown, mostly related to nodes and the various caches. File "", line 417, in |
Hi @thadeusb, I think we are basically looking for someone willing to write a fix and submit a PR. The other devs have had a lot on their plate recently with the release. If you want to implement one of the proposed fixes, that would be great! |
This document gives more info about the multi-threading support and thread-safety of HDF5. In short it is not thread-safe unless it is compiled with the |
For the record, that applies to pthreads not Python Threads. Once again, I think that the major use case for allowing Python threads is only for read-only files. Only one thread should ever have a write handle. |
See also thread [1] on the user's mailing list, it includes a nice test program that replicate the issue in a webapp. [1] https://groups.google.com/forum/?hl=it#!topic/pytables-users/8yAGVNgIxyw |
OK, I'm closing this issue that should be fixed by #314. Please note that PyTables still can't be considered thread-safe but now the user can implement its own locking machinery without interferences from the PyTables internal file cache. |
It seems like the File object you get back from tables.openFile is not thread-safe. An exmple:
I got many exceptions that look like:
It seems like Pytables is trying to cache and reuse the file handle when openFile is called. The same "File" object is returned. This is definitely not intuitive as I do not know what the expected behavior of sharing the same HDF5 handle in a threaded environment.
It also seems like the _open_file dict is hashed by file name. Thus, you can end up removing that name while another thread is closing the same file handle. I think the better way is to not cache this File object at all. Or rather, all calls to the underlying File object needs to be synchronized.
I'm running Pytables 2.3.1 with Python 2.7.2.
The text was updated successfully, but these errors were encountered: