-
Notifications
You must be signed in to change notification settings - Fork 900
osc/pt2pt: Previous locks must complete in order #2593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* To prevent deadlock force all previously requested locks to complete before starting a new lock. Otherwise this can lead to deadlock if the locks are processes in arbitrary order. Signed-off-by: Mark Allen <[email protected]>
Per discussion on the teleconf today: This issue is somewhat an MPI standard interpretation issue. Should the MPI library be responsible for avoiding deadlock in this scenario or should the application. @markalle has a small repeater test case that we can probably share to help the discussion. Until we reach a decision I'm going to mark this as do-not-merge. |
Here's a gist for a testcase example that acquires multiple locks in the same order, at least from the lock requester's perspective. If the locks are just lock requests being sent out simultaneously that might be satisfied in any order this would deadlock. https://gist.github.com/markalle/5a1cd53da9a1fa12f06e2c3e42951beb If the above is declared illegal, I think quite a few smaller cases of overlap would become illegal by implication. Even if no individual rank pair overlapped by more than one lock, you could still deadlock with "0 locks a b, 1 locks b c, 2 locks a c" if 0 gets a first, 1 gets b first, and due to non-ordering of the requests 2 gets c first. I think you'd end up with some sort of weird looking transitive property if you tried to define the allowed overlap cases. |
moving to 2.0.3 |
Will review this later today. |
@markalle Did you ever followup with the MPI Forum on the required semantics here? |
I didn't open a forum topic, but I did find the part of the standard that made me question the original test:
My first test multiple_locks.c violates this statement, but it's easy to modify the test to comply with the above and still hit the deadlock this pull request is about: Basically I just changed the test from
to
I didn't test mpich, but I did try osc=rdma and it passes. Philosophically I don't think you can say "it's the user's responsibility to not deadlock" with the current code. The standard is explicitly allowing overlapping access epochs if they're on different windows, it makes no statement about this not applying to passive target, so it should be legal to hold multiple locks. But there's no way to correctly code multiple locks to avoid deadlock when it's implemented this way. |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/d1fd2b3422fc00176a2b3840dd404050 |
@jhursey can you make it retest the PGI part? I wasn't able to see the failure there. Regarding the testcase addressing the MPI-standard question, I had to update the test again: Anyway I still think this has to be legal code. The standard specifically says multiple access epochs can overlap (to different windows). But w/o the ordering guarantee that this commit adds, I think it's impossible for an app to code correctly a deadlock-free acquisition of two locks A and B. |
bot:ibm:pgi:retest |
"Pull Request Build Checker" is saying "failed". Where does the information hide about what it's saying failed? |
@markalle Here's how you dig into the PRBC:
Repeat for any / all of the red / failed configs and/or individual builds. |
Thanks, I never would have found those. Anyway I don't think these failures are real, or at least not related to this checkin. Guess I'll hit retest again?
|
bot:ibm:pgi:retest |
retest |
bot:retest |
@markalle Do we need this PR any more? |
This PR has grown stale. We can reopen if we need it later. @markalle If the problem is still present let's make sure there is an Issue to track progress. |
To prevent deadlock force all previously requested locks to complete before starting a new lock. Otherwise this can lead to deadlock if the locks are processed in arbitrary order.