osc/pt2pt: Previous locks must complete in order #2593

jjhursey · 2016-12-16T18:34:28Z

To prevent deadlock force all previously requested locks to complete before starting a new lock. Otherwise this can lead to deadlock if the locks are processed in arbitrary order.

* To prevent deadlock force all previously requested locks to complete before starting a new lock. Otherwise this can lead to deadlock if the locks are processes in arbitrary order. Signed-off-by: Mark Allen <[email protected]>

jjhursey · 2016-12-20T20:41:01Z

Per discussion on the teleconf today: This issue is somewhat an MPI standard interpretation issue. Should the MPI library be responsible for avoiding deadlock in this scenario or should the application.

@markalle has a small repeater test case that we can probably share to help the discussion.

Until we reach a decision I'm going to mark this as do-not-merge.

markalle · 2016-12-20T22:28:17Z

Here's a gist for a testcase example that acquires multiple locks in the same order, at least from the lock requester's perspective. If the locks are just lock requests being sent out simultaneously that might be satisfied in any order this would deadlock.

https://gist.github.com/markalle/5a1cd53da9a1fa12f06e2c3e42951beb

If the above is declared illegal, I think quite a few smaller cases of overlap would become illegal by implication. Even if no individual rank pair overlapped by more than one lock, you could still deadlock with "0 locks a b, 1 locks b c, 2 locks a c" if 0 gets a first, 1 gets b first, and due to non-ordering of the requests 2 gets c first.

I think you'd end up with some sort of weird looking transitive property if you tried to define the allowed overlap cases.

hppritcha · 2016-12-24T00:03:04Z

moving to 2.0.3

hjelmn · 2017-01-12T17:33:56Z

Will review this later today.

jjhursey · 2017-01-24T21:00:35Z

@markalle @hjelmn Where did you all come down on this issue?

jsquyres · 2017-02-06T20:20:57Z

@markalle @hjelmn Where are we on this PR?

hppritcha · 2017-02-07T16:08:06Z

Has anyone tried this test with mpich using either ch3 or ch4 device? @jjhursey thinks this passed with platform mpi. @markalle could you try this test with an mpich variant?

jsquyres · 2017-02-20T15:43:46Z

@markalle @hjelmn @jjhursey 🏓

jjhursey · 2017-03-06T20:47:56Z

@markalle Did you ever followup with the MPI Forum on the required semantics here?

markalle · 2017-03-10T22:50:07Z

I didn't open a forum topic, but I did find the part of the standard that made me question the original test:

Distinct access epochs for win at the same process must be disjoint. On the other hand, epochs pertaining to different win arguments may overlap.

My first test multiple_locks.c violates this statement, but it's easy to modify the test to comply with the above and still hit the deadlock this pull request is about:
https://gist.github.com/markalle/a9d461129f8bec665b9b203681a4bb0a
This version, multiple_locks2.c, has each epoch on a different window so I can't see any standards argument against it.

Basically I just changed the test from

    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, peer=1, , win);
    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, peer=2, , win);

to

    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, peer=1, , win[1]);
    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, peer=2, , win[2]);

I didn't test mpich, but I did try osc=rdma and it passes.

Philosophically I don't think you can say "it's the user's responsibility to not deadlock" with the current code. The standard is explicitly allowing overlapping access epochs if they're on different windows, it makes no statement about this not applying to passive target, so it should be legal to hold multiple locks. But there's no way to correctly code multiple locks to avoid deadlock when it's implemented this way.

ibm-ompi · 2017-04-11T09:01:44Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d1fd2b3422fc00176a2b3840dd404050

markalle · 2017-06-12T21:55:21Z

@jhursey can you make it retest the PGI part? I wasn't able to see the failure there.

Regarding the testcase addressing the MPI-standard question, I had to update the test again:
https://gist.github.com/markalle/5fb3f8de41d36645e05ba7492df3b4c3

Anyway I still think this has to be legal code. The standard specifically says multiple access epochs can overlap (to different windows). But w/o the ordering guarantee that this commit adds, I think it's impossible for an app to code correctly a deadlock-free acquisition of two locks A and B.

jjhursey · 2017-06-13T11:52:37Z

bot:ibm:pgi:retest

markalle · 2017-06-13T20:11:33Z

"Pull Request Build Checker" is saying "failed". Where does the information hide about what it's saying failed?

jsquyres · 2017-06-13T20:33:59Z

@markalle Here's how you dig into the PRBC:

Click on the "Details" to the right
- Remember that the PRBC fronts several different builds across many different OSs / platforms
- It's basically manifested as a 2-level system
Scroll alllllll the way down to the bottom
You'll see several entries at the bottom; at least one will be red
Click on the "Build #xxx" for the red entry (do NOT click on the "console output" for it!)
Again, scroll allllll the way down to the bottom
There will be several more configs / platforms listed; at least one of them will be red
Click on the red one
Now click on "Console output"
- This is the console output of that specific failed build

Repeat for any / all of the red / failed configs and/or individual builds.

markalle · 2017-06-19T21:39:03Z

Thanks, I never would have found those. Anyway I don't think these failures are real, or at least not related to this checkin. Guess I'll hit retest again?

freebsd 11.0 : "opal_pmix_base_select failed .. It looks like orte_init failed for some reason; your parallel process is likely to abort .."
OS-X : "git fetch --tags --progress https://github.com/open-mpi/ompi/ +refs/heads/:refs/remotes/origin/ +refs/pull/:refs/remotes/origin/pr/
FATAL: java.nio.channels.ClosedChannelException"

markalle · 2017-06-19T21:39:15Z

bot:ibm:pgi:retest

markalle · 2017-07-12T17:48:13Z

retest

markalle · 2017-07-26T19:42:04Z

bot:retest

jjhursey · 2017-08-31T15:06:28Z

@markalle Do we need this PR any more?

jjhursey · 2017-10-10T15:33:44Z

This PR has grown stale. We can reopen if we need it later. @markalle If the problem is still present let's make sure there is an Issue to track progress.

osc/pt2pt: Previous locks must complete in order

59ebadb

* To prevent deadlock force all previously requested locks to complete before starting a new lock. Otherwise this can lead to deadlock if the locks are processes in arbitrary order. Signed-off-by: Mark Allen <[email protected]>

jjhursey added bug Target: v2.x labels Dec 16, 2016

jjhursey added this to the v2.0.2 milestone Dec 16, 2016

jjhursey requested a review from hjelmn December 16, 2016 18:34

This was referenced Dec 16, 2016

Topic/osc pt2pt fixes #2581

Closed

osc pt2pt wrong answer #2505

Closed

jjhursey added the ⚠️ WIP-DNM! label Dec 20, 2016

hppritcha modified the milestones: v2.0.3, v2.0.2 Dec 24, 2016

jjhursey removed this from the v2.0.3 milestone Mar 6, 2017

jjhursey closed this Oct 10, 2017

jjhursey deleted the topic/osc-pt2pt-lock-ordering branch October 10, 2017 15:33

osc/pt2pt: Previous locks must complete in order #2593

osc/pt2pt: Previous locks must complete in order #2593

Uh oh!

Conversation

jjhursey commented Dec 16, 2016

Uh oh!

jjhursey commented Dec 20, 2016

Uh oh!

markalle commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hppritcha commented Dec 24, 2016

Uh oh!

hjelmn commented Jan 12, 2017

Uh oh!

jjhursey commented Jan 24, 2017

Uh oh!

jsquyres commented Feb 6, 2017

Uh oh!

hppritcha commented Feb 7, 2017

Uh oh!

jsquyres commented Feb 20, 2017

Uh oh!

jjhursey commented Mar 6, 2017

Uh oh!

markalle commented Mar 10, 2017

Uh oh!

ibm-ompi commented Apr 11, 2017

Uh oh!

markalle commented Jun 12, 2017

Uh oh!

jjhursey commented Jun 13, 2017

Uh oh!

markalle commented Jun 13, 2017

Uh oh!

jsquyres commented Jun 13, 2017

Uh oh!

markalle commented Jun 19, 2017

Uh oh!

markalle commented Jun 19, 2017

Uh oh!

markalle commented Jul 12, 2017

Uh oh!

markalle commented Jul 26, 2017

Uh oh!

jjhursey commented Aug 31, 2017

Uh oh!

jjhursey commented Oct 10, 2017

Uh oh!

Uh oh!

markalle commented Dec 20, 2016 •

edited

Loading