-
-
Notifications
You must be signed in to change notification settings - Fork 115
filelock 3.11 breaks our multi-threaded parallel processing use case #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not sure what would be a workaround here, clearly 99% of case makes sense to have it thread local, but how could we allow usage for this 1% 🤔 cc @csm10495 |
Any chance we can have a small code sample to play with? Is it one lock per thread or are the same locks ever used across threads? Or does the initial process just generate locks for threads to eventually use? |
I tried via this sample (and the latest released filelock):
It seems to work properly |
@slanzmich can you confirm please |
Same issue there - we use the lock to prevent concurrent access between batch processes (not threads). Our code is asynchronous ( More generally, using |
@dmoklaf please can you provide a minimal code sample to reproduce? I tried via:
But it works fine for me. |
Maybe its something like this:
The assertion fails very often on both 3.10.7 and 3.11.0 .. though often on 3.11.0 it hangs. The possible hang in 3.11.0 is because the order of the gather'd tasks isn't guaranteed by asyncio. So it can wind up doing an acquire while already locked leading to deadlock.. The same hang doesn't happen in 3.10.7 because the lock object is re-entrant and allows you to The assertion in either case shows that this type of usage isn't really safe (in either case) due to the ordering of the threads running. I feel like this seems like a bit of a stretch though so please if either of you can provide a similarly length failing case, i'd be happy evaluate further. |
Here is some sample code that reproduces the issue. The symptom shall be that the programs freezes at some point - this requires to use filelock 3.11.0 (works perfectly with 3.10.7) AND to have some multi-threading going on (hence why I run parallel tasks below and pauses to force the asyncio runtime to allocate multiple threads in its internal pool).
|
Thanks for the example. Why is the lock needed if each For example I pulled out the calls to acquire/release (removing the need for the lock) and tried:
and it seemed to work because |
Here is a similar version that has a chance for overlapping a segment:
This version on 3.10.7 is likely to hit an assertion error because of the re-entrant nature of the lock, which allows multiple threads to work with the segment at the same time. (On 3.11.0 it'll probably hang) If you don't have overlapping segments (via removing the |
This is because we use filelock to prevent separate processes from stepping on each other toes while processing data segments - this is our only interest for filelock. This example is an extreme simplification of our python processes, that I extracted only to reproduce the "freeze" bug as requested - it does not show the multi-processing part of course. But running several times in parallel this process through the bash shell could simulate the multi-processing part easily. It works perfectly with filelock < 3.11. And it fails every time with 3.11. |
I tried this with 3.11.0:
This does both multi process and asyncio, and didn't get a hang. What OS are you on? I'm on Windows. |
On a Linux docker i can see the hang, but i think its expected and that the previous behavior may have been misleading. Consider this sample:
On Linux with 3.10.7 this prints:
On 3.11 this hangs.. which makes sense to me since the lock was not released (so the other threads can't acquire it). The lock is re-entrant which means that each time acquire is called in the same process it just adds to the lock count and continues. If you're locking ranges in the same process using the same object for the purpose of modifying the underlying resource: it isn't really locking it (cross thread) with 3.10.7. The new behavior actually does lock cross thread which I believe was the more natural expectation here. @gaborbernat what are you're thoughts here? In theory we could use a metaclass or something to allow for the old behavior along with the new behavior, but I'm kind of concerned that issues hit from the swap from 3.10 to 3.11 may be pointing to other bugs or potential issues elsewhere in folks's stacks. |
My impression is that this thread's local behavior perhaps should be on by default, however one should be able to disable it. |
Fair enough. I'll look into a way to make that possible. |
I am on MacOS 13.3.1 with Python 3.10.10 I tested your code and, indeed, it works flawlessly. However my code still freezes. The difference between both our codes is only that you multiprocess while my code doesn't (it runs as a single process). I have just checked our logs and indeed, the freezes occur at times where a single process is running (which is completely random and not something we can actively control, it depends on the duration of jobs, length of the data batches, ...) - not when multiple processes are running concurrently! So it seems there is something within filelock's 3.11 version that freezes single processes, and this bug is hidden/does not happen when multiple processes are operating concurrently. |
Using multiple threads (and not the same one) when wrapping synchronous API with asyncio is (unfortunately) a common practice. So, if you make that choice, more and more developers will need to activate this setting, as asyncio's popularity is growing fast - it will be productive if this setting is announced front and center in the project's homepage. Otherwise they will bug this issue list endlessly. Second, why is some thread-local code absolutely necessary? The unix code seems simple, is that related to the Windows code? Or something else? |
The original code is reentrant across threads, effectively making it not work as expected across threads. I reported for Windows, others reported on Mac. See the conversation in #82. Do you expect that multiple threads can acquire and hold the lock at the same time? Generally I don't think that's intuitive, since even if using asyncio, you can have cases where multiple writers are holding the lock (and in theory could be writing). If they could never be writing at the same time, the lock wouldn't be needed. |
Calling a blocking function from separate, unrelated threads is classic with asyncio (that's how most synchronous libraries are embedded in asynchronous code). So forbidding this case, or making it non-trivial through configuration arguments, may lead more and more developers to declare GitHub issues here as asyncio is gaining in popularity. The solution I propose is simple - and I believe the library as it was before 3.11 was fine for both as long the developper was doing the Right Thing (TM). Here it is - I see 2 cases for reentrance/concurrency against filelock: - With multi-tasks code (asyncio lib)
- With multi-threaded code (threading lib)
So, as long as it is written more clearly in the documentation that this library only ensures inter-process locking and that it has undefined behaviour if several tasks or several threads access at the same time the lock, #82 does not seem like a bug - but an undefined behaviour of the library while called in a forbidden way. The developper has to ensure inter-tasks locking or inter-threads locking himself, and everything as it was before 3.11 will work perfectly. Alternatively, I don't see how the filelock library could handle this complexity itself and hide it from the developper, as it would have to handle both the multi-tasks case and the multi-threaded case. These are very different cases with different technological stacks. The reason being that Python, now, proposes 2 very different ways to execute code in parallel. |
Interesting. @gaborbernat (as author) what was the library's initial intent here? Was it just inter-process locking or also inter-thread locking? |
I don't think there was an eplixicit intent here 🤔 so I'd go for a solution that supports both, though perhaps configurable 🤔 |
Opened a PR that should give us both behaviors. Hopefully it gives us all worlds. |
Sorry for not following up earlier, I was out for a couple of days. @csm10495 / @gaborbernat Does it still make sense to provide a code example? I only saw the issue on Windows, but I haven't checked whether it also fails on other platforms. |
Thank you for taking quick action to provide an alternative to others. We were able to use it effectively to get workable code with a multi threaded workload. I'm personally not too sure that the new behavior, thread local by default, is appropriate. I feel like the multi threaded worjload described above is quite a common path for those utilizing locks. I'm not asking for any immediate action, but more i want to raise my voice as one user that would support making the thread local behavior opt in as opposed to opt out. |
Hi,
In #219 @csm10495 wondered
I guess we have a use case that breaks:
We have a multi-threaded processing pipeline that is set up roughly like this:
SoftFileLock
for the output file.These steps are executed for multiple inputs, each generating a different output file. Step 2 for one input can run in parallel to step 3 for another input. However, step 3 for one input must not run in parallel to step 3 for another input, as it uses resources that shall not be shared.
The
SoftFileLock
is required because there may be another process that is executed on an overlapping set of input files (thus generating an overlapping set of output files). With filelock 3.10, we could pass the lock from the thread executing step 1 to the thread executing step 4, but this broke with #219.For this use case, I don't see anything wrong with sharing the lock between threads, because the lock is intended to prevent concurrent access between different processes. What do you think?
The text was updated successfully, but these errors were encountered: