File object is not thread-safe #130

will133 · 2012-02-16T16:16:56Z

It seems like the File object you get back from tables.openFile is not thread-safe. An exmple:

import threading
import tables
import os

class OpenFileThread(threading.Thread):
    def run(self):
        f = tables.openFile('test.h5', mode='r')
        arr = f.root.testArray[8:12, 18:22]
        f.close()

if __name__  == '__main__':
    if not os.path.exists('test.h5'):
        testFile = tables.openFile('test.h5', mode='w')
        group = testFile.createCArray(testFile.root, 'testArray', tables.Int64Atom(), (200, 300))
        testFile.close()

    threads = []
    for i in xrange(10):
        t = OpenFileThread()
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

I got many exceptions that look like:

Exception in thread Thread-7:
Traceback (most recent call last):
...
  File ".../site-packages/tables/file.py", line 2162, in close
    del _open_files[filename]
KeyError: 'test.h5'

It seems like Pytables is trying to cache and reuse the file handle when openFile is called. The same "File" object is returned. This is definitely not intuitive as I do not know what the expected behavior of sharing the same HDF5 handle in a threaded environment.

It also seems like the _open_file dict is hashed by file name. Thus, you can end up removing that name while another thread is closing the same file handle. I think the better way is to not cache this File object at all. Or rather, all calls to the underlying File object needs to be synchronized.

I'm running Pytables 2.3.1 with Python 2.7.2.

The text was updated successfully, but these errors were encountered:

scopatz · 2012-02-16T18:43:15Z

Hmmm This is disconcerting... Thanks for pointing out @will133.

I'll have to look at this more carefully. I think what we'll want to end up implementing is to keep the current method but add a thread-safe bypass to the API for when needed:

f = tables.openFile(..., threadsafe=True)

The default to this should be False to keep the current behavior enabled. If you end up with a pull request which implements this, I would be happy to review it.

timeu · 2012-02-17T14:39:09Z

I ran into the same issue some time ago when I was opening hdf5 files from a cherrypy web-server.
I tried to introduce a reference counter and close the file when this was 0 but it somehow wasn't really working. So I ended up not closing the files.
I think there are 2 solutions:

Don't cache the file handle at all.
Use a reference counter + lock to count the number of times the file was open and close it if it is 0 (see http://effbot.org/zone/thread-synchronization.htm)
In the open function acquire lock on the counter and increment it by one then release the lock.
In the close function acquire lock on counter if 0 close else decrement counter by one release lock

I think solutions 1 is easier and less error prone. I am not sure how much performance benefit one actually gets from caching the file handle (maybe test it with a file with huge amounts of groups/nodes).

scopatz · 2012-02-18T07:32:22Z

I agree with @timeu. Solution 1 (don't cache the file handle) is the one that we should use as a bypass for thread safety.

joshayers · 2012-08-03T06:11:58Z

FYI...I started an attempt at fixing this. I'm going with solution 1 above - add a threadsafe keyword argument and bypass the _open_files cache in that case. It still needs a lot of tests, and I'll probably be asking for advice in the near future.

avalentino · 2012-08-03T07:32:41Z

Hi @joshayers, please consider that bypassing the _open_files registry can cause data loss at program exit if some file is left open.

IMO there are two design alternatives:

to find another way to nicely close files that are left open (e.g. play with __del__ - dangerous)
to stress, at documentation level, that it is responsibility of the user to properly close all files before program exit

Also, unless you are 100% sure that you are able to fix all possible threading issues it would be better, IMO, to use a different name for the keyword argument, maybe no_registry, use_registry=False or something similar.

will133 · 2012-08-03T13:20:43Z

In my code I usually wrap handle with the with statement, so closing is usually not a huge issue. I think the safest thing to do is to not cache and store the file handles in set instead of a dictionary from fileName => handle. Thus when the program exit it'll just close all the file handles in the set. I think any attempt to reuse the file handle is pretty dangerous and properly hard to get right.

clutchski · 2013-02-28T17:51:02Z

Hey all,

I'm happy to implement a solution for this, as long as the project committers agree on an approach. The possible solutions listed in this thread are:

Remove the _open_files cache
Add synchronization around the _open_files cache
Add an option which allows you to opt out of the open files cache
Add an alternate openFile method with opts out of the open files cache
Something else?

In general, I'd prefer the first solution. I think it's the grown-up thing to do - if you open it, you close it. But the second is probably the best if there are legions of people out there not closing their own files.

Anyway, I'd really like to see a solution to this land. Rolling up our data into pytables files is a bottleneck in our code right now and I'd hate to have to re-think our whole concurrency model because one library doesn't allow you to safely open multiple files in their own threads.

Thanks!

scopatz · 2013-02-28T18:52:09Z

Hi @clutchski, thanks for taking this on. I think by virtue of actually working on this problem, you get to make some of the decisions. I haven't thought about the full repercussions of just removing the _open_files cache, though I am generally in favor of this (grownup thing to do etc). I would encourage you to look into this!

Alternatively, as I said at the top of this issue, given what might break, I think adding a keyword argument to openFile() bypass the cache is very reasonable:

f = tables.openFile(..., threadsafe=True)

I'd like to see this issue resolved for you.

will133 · 2013-02-28T19:21:09Z

Thanks for looking into this problem! I don't know about other uses cases (like how often the cache is hit) but here is what I'd prefer:

By default, do not do caching at all but put each file handle in a set (hash by reference instead of the file name string). Thus, the exit function can close those file handles separately. If you want to do caching and you want to reuse the handle, I think it should be the user's responsibility to do so. It's just pretty confusing now since it doesn't quite map one-to-one to the underlying hdf5 handle and it doesn't really do what the user expect it to do.

clutchski · 2013-03-05T00:17:07Z

Cool. I'll take a stab and keep you posted.

andreabedini · 2013-03-08T03:09:48Z

IMHO pytables should not bother to do any caching. tables.file.File object should work exactly as a python's builtin file object. Handles should be closed as the File object gets destroyed (as it would happen for builtin file). This would simplify the code and increase maintainability. Caching can be easily implemented by the user if required. In my use I found the pytables caching system buggy and frustrating. Just my 2c.

edit: an exception perhaps can be made for external links (in that case the dictionary can be put in File). I'm religiously against maintaining a global state :)

scopatz · 2013-04-29T15:20:58Z

Does anyone have any code they would like to see for this that should go into v3.0? Now is your chance, even if it is only partially done.

scopatz · 2013-05-01T17:40:23Z

So the option that I am seeing as being the easiest to implement and the most stable is just to remove the _open_files cache entirely. Is this basically how other people feel as well?

avalentino · 2013-05-01T18:29:03Z

@scopatz personally I like the solution proposed by @will133:

By default, do not do caching at all but put each file handle in a set (hash by reference instead of the file name string). Thus, the exit function can close those file handles separately.

scopatz · 2013-05-01T18:43:21Z

@avalentino Should we still keep the current caching mechanism around though? And then make the user ask for caching explicitly? (Also, when a user closes a file, its id should be removed from this set, right?) Als o, is this something that you want to implement? I'd be happy to tackle #61 if you want to take of this ;)

avalentino · 2013-05-01T19:39:57Z

@avalentino Should we still keep the current caching mechanism around though? And then make the user ask for caching explicitly? (Also, when a user closes a file, its id should be removed from this set, right?)

no, just dropping the caching part (reuse of file handles) and keep the part that takes track of open files so that we can close them at exit. And yes, files that are closes explicitly by the user should be removed by the set.

In any case this is a very delicate matter IMO and it can be addressed without API changes.
Probably it is better to postpone this task to the next feature release.

scopatz · 2013-05-02T16:07:19Z

Ok. I am going to move this to v3.1.

thadeusb · 2013-06-06T21:43:32Z

Will this issue fix in 3.1 allow me to revert back to accessing the same data file in multiple threads at once?

I am running PyTables 2.4 on apache mod_wsgi to serve data files that are opened in read-only mode.

Here are some different execution models

Running 10 clients at about 20 requests/second

Method A: threads=15 processes=1, files opened read only, files under a shared flock, fails (see exception below)
Method B: threads=15 processes=1, files opened append, files under a exclusive flock, success (but slow due to serializing the data file access to one thread at a time)
Method C: threads=1 processes=15, files opened read only, files under a shared flock, success (what I am using currently)

The exception that gets triggered the most when using Method A is, however there are many exceptions that get thrown, mostly related to nodes and the various caches.

File "", line 417, in
nodes = [x.name for x in self.db.listNodes(grp)]
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 1469, in listNodes
return group._f_listNodes(classname)
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 650, in _f_listNodes
return list(self._f_iterNodes(classname))
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 670, in _f_iterNodes
yield self._v_children[name]
File "/usr/local/lib/python2.7/dist-packages/tables/misc/proxydict.py", line 30, in getitem
return self._getValueFromContainer(self._getContainer(), key)
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 40, in _getValueFromContainer
return container._f_getChild(key)
File "/usr/local/lib/python2.7/dist-packages/tables/group.py", line 642, in _f_getChild
return self._v_file._getNode(childPath)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 1048, in _getNode
"stale weak reference to dead node %s" % nodePath
AssertionError: stale weak reference to dead node /data

scopatz · 2013-06-07T02:01:09Z

Hi @thadeusb, I think we are basically looking for someone willing to write a fix and submit a PR. The other devs have had a lot on their plate recently with the release. If you want to implement one of the proposed fixes, that would be great!

andreabedini · 2013-09-17T22:03:06Z

This document gives more info about the multi-threading support and thread-safety of HDF5. In short it is not thread-safe unless it is compiled with the --enable-threadsafe option.

scopatz · 2013-09-19T01:25:58Z

For the record, that applies to pthreads not Python Threads. Once again, I think that the major use case for allowing Python threads is only for read-only files. Only one thread should ever have a write handle.

avalentino · 2013-12-14T17:26:54Z

See also thread [1] on the user's mailing list, it includes a nice test program that replicate the issue in a webapp.

[1] https://groups.google.com/forum/?hl=it#!topic/pytables-users/8yAGVNgIxyw

avalentino · 2014-01-17T10:53:21Z

OK, I'm closing this issue that should be fixed by #314.
A unittest that is very similar to the test code attached by @will133 has been included in the test suite.

Please note that PyTables still can't be considered thread-safe but now the user can implement its own locking machinery without interferences from the PyTables internal file cache.

jreback mentioned this issue Nov 30, 2012

Different HDFStores in multiple threads crashes Python pandas-dev/pandas#2397

Closed

153957 mentioned this issue Mar 1, 2013

Add new views for a new 'live-display' page HiSPARC/publicdb#12

Closed

andreabedini mentioned this issue Sep 12, 2013

remove file cache mechanism #280

Closed

2 tasks

avalentino mentioned this issue Jan 6, 2014

Remove the file cache mechanism (alternative implementation) #314

Merged

ghost assigned avalentino Jan 17, 2014

avalentino closed this as completed Jan 17, 2014

153957 mentioned this issue Feb 22, 2016

Downloading coincidences sometimes stops before it should HiSPARC/publicdb#117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File object is not thread-safe #130

File object is not thread-safe #130

will133 commented Feb 16, 2012

scopatz commented Feb 16, 2012

timeu commented Feb 17, 2012

scopatz commented Feb 18, 2012

joshayers commented Aug 3, 2012

avalentino commented Aug 3, 2012

will133 commented Aug 3, 2012

clutchski commented Feb 28, 2013

scopatz commented Feb 28, 2013

will133 commented Feb 28, 2013

clutchski commented Mar 5, 2013

andreabedini commented Mar 8, 2013

scopatz commented Apr 29, 2013

scopatz commented May 1, 2013

avalentino commented May 1, 2013

scopatz commented May 1, 2013

avalentino commented May 1, 2013

scopatz commented May 2, 2013

thadeusb commented Jun 6, 2013

scopatz commented Jun 7, 2013

andreabedini commented Sep 17, 2013

scopatz commented Sep 19, 2013

avalentino commented Dec 14, 2013

avalentino commented Jan 17, 2014

File object is not thread-safe #130

File object is not thread-safe #130

Comments

will133 commented Feb 16, 2012

scopatz commented Feb 16, 2012

timeu commented Feb 17, 2012

scopatz commented Feb 18, 2012

joshayers commented Aug 3, 2012

avalentino commented Aug 3, 2012

will133 commented Aug 3, 2012

clutchski commented Feb 28, 2013

scopatz commented Feb 28, 2013

will133 commented Feb 28, 2013

clutchski commented Mar 5, 2013

andreabedini commented Mar 8, 2013

scopatz commented Apr 29, 2013

scopatz commented May 1, 2013

avalentino commented May 1, 2013

scopatz commented May 1, 2013

avalentino commented May 1, 2013

scopatz commented May 2, 2013

thadeusb commented Jun 6, 2013

scopatz commented Jun 7, 2013

andreabedini commented Sep 17, 2013

scopatz commented Sep 19, 2013

avalentino commented Dec 14, 2013

avalentino commented Jan 17, 2014