-
Notifications
You must be signed in to change notification settings - Fork 900
MPIR broken on the v4.0.x branch #6613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I wonder if it's related to this PMIx issue (found while working on a similar MPIR hang):
Can you try using the v3.1.3rc1 to see if it resolves the issue? |
Thanks for the response. The issue doesn't happen on v3.1.3rc1, but then again it doesn't reproduce on the v3.1.2 tag either, so I suspect it may be a different problem. |
It's surprising that you could not reproduce with v3.1.2 directly, as we were able to when investigating the similar issue we were tracking. |
@jjhursey are you close to getting this issue resolved? |
We are waiting for PMIx v3.1.3rc3 to pick up the event handling fix. There were a couple of issues with PMIx v3.1.3rc2 that are being resolved right now. We hope to have that ready by Thursday this week. Once we have it ready, I'll prepare a PR for the OMPI v4.0.x branch. Per the comment from earlier in the thread, there may be another issue to track down but the PMIx update does make the problem disappear. The PMIx event issue was the bug that I found when working a similar issue with MPIR, TotalView, and the OMPI v4.0.x branch. |
I can confirm that building OMPI v4.0.x with PMIx v3.1.3rc3 fixes this issue. |
@jjhursey will the next v4.0.x release have this version of PMIx? |
The PMIx release is being held at the moment so we can investigate a separate issue. I missed the Open MPI teleconf this week so I'm not sure if they are holding the next v4.0.x release for the PMix official release or not. @gpaulsen @hppritcha might know better. |
Howard and I are meeting this afternoon and will discuss this.Is there any estimate to how long pmix 3.1.3 will be on hold until it's released? If we can't bound it, we may need to decouple the two releases.
---Geoffrey PaulsenSoftware Engineer, IBM Spectrum MPIEmail: [email protected]
----- Original message -----From: Josh Hursey <[email protected]>To: open-mpi/ompi <[email protected]>Cc: Geoff Paulsen <[email protected]>, Mention <[email protected]>Subject: [EXTERNAL] Re: [open-mpi/ompi] MPIR broken on the v4.0.x branch (#6613)Date: Fri, May 31, 2019 8:57 AM
The PMIx release is being held at the moment so we can investigate a separate issue.
I missed the Open MPI teleconf this week so I'm not sure if they are holding the next v4.0.x release for the PMix official release or not. @gpaulsen @hppritcha might know better.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
|
@rhc54 might have an opinion here. We are trying to scope the problem now. In particular, we want to make sure we have a solid understanding of some of the failures we are seeing and if we need to include any more fixes in the 3.1.3 release before it goes out. I don't have a firm date at the moment. |
I'm afraid I don't have anything firm either. I'm hoping we can complete things and do the official release by the end of June, but realistically it may slide into July. However, looking at the milestone schedule, v4.0.2 isn't coming out until end Sept at the earliest, so I don't see us holding that up. We will advise if something unexpected comes up. |
Does this only require a PMIx update or also OMPI code change? |
It's just a PMIx change. So you need an updated PMIx. We can give you the rc3, but the problem is that you could knowingly be shipping something may have some other issues. We can discuss this a bit more on the Tuesday call. |
New PMIx update has been merged to v4.0.x @hppritcha has volunteered to verify this fix, before we close this issue. |
verified this works with v4.0.x at e547a2b |
The MPIR debugging interface is broken in the v4.0.x branch. Attempting to use it causes the mpirun process to hang. This bug is present in v4.0.1 binaries, and freshly compiled version of the v4.0.x head. The bug was not present in the 4.0.0 release build.
This bug was found to reproduce on Centos 7 machine using Intel Compiler icc version 18.2 and also gcc version 8.1. No extra configuration flags were specified.
A git bisect shows the breaking change was 335f8c5: Update to PMIx 3.1.2. A simple reproducer is to compile a hello world_mpi program, then launch mpirun under gdb as follows. Note the behavior for 4.0.0 is correct; the hello_world application runs to completion. In the 4.0.1 case the mpirun process hangs and the hello_world application is unable to progress.
I used the following command to reproduce this issue:
gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'
For 4.0.0, everything works fine:
While for newer versions, a hang occurs:
The text was updated successfully, but these errors were encountered: