Skip to content

File object is not thread-safe #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
will133 opened this issue Feb 16, 2012 · 23 comments
Closed

File object is not thread-safe #130

will133 opened this issue Feb 16, 2012 · 23 comments
Assignees
Milestone

Comments

@will133
Copy link

will133 commented Feb 16, 2012

It seems like the File object you get back from tables.openFile is not thread-safe. An exmple:

import threading
import tables
import os

class OpenFileThread(threading.Thread):
    def run(self):
        f = tables.openFile('test.h5', mode='r')
        arr = f.root.testArray[8:12, 18:22]
        f.close()

if __name__  == '__main__':
    if not os.path.exists('test.h5'):
        testFile = tables.openFile('test.h5', mode='w')
        group = testFile.createCArray(testFile.root, 'testArray', tables.Int64Atom(), (200, 300))
        testFile.close()

    threads = []
    for i in xrange(10):
        t = OpenFileThread()
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

I got many exceptions that look like:

Exception in thread Thread-7:
Traceback (most recent call last):
...
  File ".../site-packages/tables/file.py", line 2162, in close
    del _open_files[filename]
KeyError: 'test.h5'

It seems like Pytables is trying to cache and reuse the file handle when openFile is called. The same "File" object is returned. This is definitely not intuitive as I do not know what the expected behavior of sharing the same HDF5 handle in a threaded environment.

It also seems like the _open_file dict is hashed by file name. Thus, you can end up removing that name while another thread is closing the same file handle. I think the better way is to not cache this File object at all. Or rather, all calls to the underlying File object needs to be synchronized.

I'm running Pytables 2.3.1 with Python 2.7.2.

@scopatz
Copy link
Member

scopatz commented Feb 16, 2012

Hmmm This is disconcerting... Thanks for pointing out @will133.

I'll have to look at this more carefully. I think what we'll want to end up implementing is to keep the current method but add a thread-safe bypass to the API for when needed:

f = tables.openFile(..., threadsafe=True)

The default to this should be False to keep the current behavior enabled. If you end up with a pull request which implements this, I would be happy to review it.

@timeu
Copy link
Member

timeu commented Feb 17, 2012

I ran into the same issue some time ago when I was opening hdf5 files from a cherrypy web-server.
I tried to introduce a reference counter and close the file when this was 0 but it somehow wasn't really working. So I ended up not closing the files.
I think there are 2 solutions:

  • Don't cache the file handle at all.
  • Use a reference counter + lock to count the number of times the file was open and close it if it is 0 (see http://effbot.org/zone/thread-synchronization.htm)
    In the open function acquire lock on the counter and increment it by one then release the lock.
    In the close function acquire lock on counter if 0 close else decrement counter by one release lock

I think solutions 1 is easier and less error prone. I am not sure how much performance benefit one actually gets from caching the file handle (maybe test it with a file with huge amounts of groups/nodes).

@scopatz
Copy link
Member

scopatz commented Feb 18, 2012

I agree with @timeu. Solution 1 (don't cache the file handle) is the one that we should use as a bypass for thread safety.

@joshayers
Copy link
Member

FYI...I started an attempt at fixing this. I'm going with solution 1 above - add a threadsafe keyword argument and bypass the _open_files cache in that case. It still needs a lot of tests, and I'll probably be asking for advice in the near future.

@avalentino
Copy link
Member

Hi @joshayers, please consider that bypassing the _open_files registry can cause data loss at program exit if some file is left open.

IMO there are two design alternatives:

  • to find another way to nicely close files that are left open (e.g. play with __del__ - dangerous)
  • to stress, at documentation level, that it is responsibility of the user to properly close all files before program exit

Also, unless you are 100% sure that you are able to fix all possible threading issues it would be better, IMO, to use a different name for the keyword argument, maybe no_registry, use_registry=False or something similar.

@will133
Copy link
Author

will133 commented Aug 3, 2012

In my code I usually wrap handle with the with statement, so closing is usually not a huge issue. I think the safest thing to do is to not cache and store the file handles in set instead of a dictionary from fileName => handle. Thus when the program exit it'll just close all the file handles in the set. I think any attempt to reuse the file handle is pretty dangerous and properly hard to get right.

@clutchski
Copy link

Hey all,

I'm happy to implement a solution for this, as long as the project committers agree on an approach. The possible solutions listed in this thread are:

  • Remove the _open_files cache
  • Add synchronization around the _open_files cache
  • Add an option which allows you to opt out of the open files cache
  • Add an alternate openFile method with opts out of the open files cache
  • Something else?

In general, I'd prefer the first solution. I think it's the grown-up thing to do - if you open it, you close it. But the second is probably the best if there are legions of people out there not closing their own files.

Anyway, I'd really like to see a solution to this land. Rolling up our data into pytables files is a bottleneck in our code right now and I'd hate to have to re-think our whole concurrency model because one library doesn't allow you to safely open multiple files in their own threads.

Thanks!

@scopatz
Copy link
Member

scopatz commented Feb 28, 2013

Hi @clutchski, thanks for taking this on. I think by virtue of actually working on this problem, you get to make some of the decisions. I haven't thought about the full repercussions of just removing the _open_files cache, though I am generally in favor of this (grownup thing to do etc). I would encourage you to look into this!

Alternatively, as I said at the top of this issue, given what might break, I think adding a keyword argument to openFile() bypass the cache is very reasonable:

f = tables.openFile(..., threadsafe=True)

I'd like to see this issue resolved for you.

@will133
Copy link
Author

will133 commented Feb 28, 2013

Thanks for looking into this problem! I don't know about other uses cases (like how often the cache is hit) but here is what I'd prefer:

By default, do not do caching at all but put each file handle in a set (hash by reference instead of the file name string). Thus, the exit function can close those file handles separately. If you want to do caching and you want to reuse the handle, I think it should be the user's responsibility to do so. It's just pretty confusing now since it doesn't quite map one-to-one to the underlying hdf5 handle and it doesn't really do what the user expect it to do.

@clutchski
Copy link

Cool. I'll take a stab and keep you posted.

@andreabedini
Copy link
Contributor

IMHO pytables should not bother to do any caching. tables.file.File object should work exactly as a python's builtin file object. Handles should be closed as the File object gets destroyed (as it would happen for builtin file). This would simplify the code and increase maintainability. Caching can be easily implemented by the user if required. In my use I found the pytables caching system buggy and frustrating. Just my 2c.

edit: an exception perhaps can be made for external links (in that case the dictionary can be put in File). I'm religiously against maintaining a global state :)

@scopatz
Copy link
Member

scopatz commented Apr 29, 2013

Does anyone have any code they would like to see for this that should go into v3.0? Now is your chance, even if it is only partially done.

@scopatz
Copy link
Member

scopatz commented May 1, 2013

So the option that I am seeing as being the easiest to implement and the most stable is just to remove the _open_files cache entirely. Is this basically how other people feel as well?

@avalentino
Copy link
Member

@scopatz personally I like the solution proposed by @will133:

By default, do not do caching at all but put each file handle in a set (hash by reference instead of the file name string). Thus, the exit function can close those file handles separately.

@scopatz
Copy link
Member

scopatz commented May 1, 2013

@avalentino Should we still keep the current caching mechanism around though? And then make the user ask for caching explicitly? (Also, when a user closes a file, its id should be removed from this set, right?) Als o, is this something that you want to implement? I'd be happy to tackle #61 if you want to take of this ;)

@avalentino
Copy link
Member

@avalentino Should we still keep the current caching mechanism around though? And then make the user ask for caching explicitly? (Also, when a user closes a file, its id should be removed from this set, right?)

no, just dropping the caching part (reuse of file handles) and keep the part that takes track of open files so that we can close them at exit. And yes, files that are closes explicitly by the user should be removed by the set.

In any case this is a very delicate matter IMO and it can be addressed without API changes.
Probably it is better to postpone this task to the next feature release.

@scopatz
Copy link
Member

scopatz commented May 2, 2013

Ok. I am going to move this to v3.1.

@thadeusb
Copy link
Contributor

thadeusb commented Jun 6, 2013

Will this issue fix in 3.1 allow me to revert back to accessing the same data file in multiple threads at once?

I am running PyTables 2.4 on apache mod_wsgi to serve data files that are opened in read-only mode.

Here are some different execution models

Running 10 clients at about 20 requests/second

Method A: threads=15 processes=1, files opened read only, files under a shared flock, fails (see exception below)
Method B: threads=15 processes=1, files opened append, files under a exclusive flock, success (but slow due to serializing the data file access to one thread at a time)
Method C: threads=1 processes=15, files opened read only, files under a shared flock, success (what I am using currently)

The exception that gets triggered the most when using Method A is, however there are many exceptions that get thrown, mostly related to nodes and the various caches.

File "", line 417, in
nodes = [x.name for x in self.db.listNodes(grp)]
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 1469, in listNodes
return group._f_listNodes(classname)
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 650, in _f_listNodes
return list(self._f_iterNodes(classname))
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 670, in _f_iterNodes
yield self._v_children[name]
File "/usr/local/lib/python2.7/dist-packages/tables/misc/proxydict.py", line 30, in getitem
return self._getValueFromContainer(self._getContainer(), key)
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 40, in _getValueFromContainer
return container._f_getChild(key)
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 642, in _f_getChild
return self._v_file._getNode(childPath)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 1048, in _getNode
"stale weak reference to dead node %s" % nodePath
AssertionError: stale weak reference to dead node /data

@scopatz
Copy link
Member

scopatz commented Jun 7, 2013

Hi @thadeusb, I think we are basically looking for someone willing to write a fix and submit a PR. The other devs have had a lot on their plate recently with the release. If you want to implement one of the proposed fixes, that would be great!

@andreabedini
Copy link
Contributor

This document gives more info about the multi-threading support and thread-safety of HDF5. In short it is not thread-safe unless it is compiled with the --enable-threadsafe option.

@scopatz
Copy link
Member

scopatz commented Sep 19, 2013

For the record, that applies to pthreads not Python Threads. Once again, I think that the major use case for allowing Python threads is only for read-only files. Only one thread should ever have a write handle.

@avalentino
Copy link
Member

See also thread [1] on the user's mailing list, it includes a nice test program that replicate the issue in a webapp.

[1] https://groups.google.com/forum/?hl=it#!topic/pytables-users/8yAGVNgIxyw

@avalentino
Copy link
Member

OK, I'm closing this issue that should be fixed by #314.
A unittest that is very similar to the test code attached by @will133 has been included in the test suite.

Please note that PyTables still can't be considered thread-safe but now the user can implement its own locking machinery without interferences from the PyTables internal file cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants