Skip to content

MPIR broken on the v4.0.x branch #6613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stephen-roberts-work opened this issue Apr 25, 2019 · 15 comments
Closed

MPIR broken on the v4.0.x branch #6613

stephen-roberts-work opened this issue Apr 25, 2019 · 15 comments

Comments

@stephen-roberts-work
Copy link

The MPIR debugging interface is broken in the v4.0.x branch. Attempting to use it causes the mpirun process to hang. This bug is present in v4.0.1 binaries, and freshly compiled version of the v4.0.x head. The bug was not present in the 4.0.0 release build.

This bug was found to reproduce on Centos 7 machine using Intel Compiler icc version 18.2 and also gcc version 8.1. No extra configuration flags were specified.

A git bisect shows the breaking change was 335f8c5: Update to PMIx 3.1.2. A simple reproducer is to compile a hello world_mpi program, then launch mpirun under gdb as follows. Note the behavior for 4.0.0 is correct; the hello_world application runs to completion. In the 4.0.1 case the mpirun process hangs and the hello_world application is unable to progress.

I used the following command to reproduce this issue:

gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'

For 4.0.0, everything works fine:

$ module load mpi/openmpi/gcc-8.1.0/4.0.0
$ mpicxx -g -O0 -std=c++11  main.cpp -o main
$ gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'
Reading symbols from /home/sroberts/Tickets/xxx/gcc-installs/openmpi-4.0.0-install/bin/orterun...(no debugging symbols found)...done.
Temporary breakpoint 1 at 0x400db0
Starting program: /home/sroberts/Tickets/HPCL3-624/gcc-installs/openmpi-4.0.0-install/bin/mpirun -np 3 ./main
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Temporary breakpoint 1, 0x0000000000400db0 in main ()
Continuing.
[New Thread 0x7ffff2b39700 (LWP 32335)]
[New Thread 0x7ffff1d24700 (LWP 32336)]
[New Thread 0x7ffff0b01700 (LWP 32337)]
[New Thread 0x7fffebfff700 (LWP 32338)]
--------------------------------------------------------------------------
Open MPI has detected that you have attached a debugger to this MPI
job, and that debugger is using the legacy "MPIR" method of
attachment.
....
disable this warning by setting the OMPI_MPIR_DO_NOT_WARN envar to 1.
--------------------------------------------------------------------------
Detaching after fork from child process 32339.
Detaching after fork from child process 32340.
Detaching after fork from child process 32341.
Hello World!
SUCCESS
[Thread 0x7ffff2b39700 (LWP 32335) exited]
[Thread 0x7ffff1d24700 (LWP 32336) exited]
[Thread 0x7fffebfff700 (LWP 32338) exited]
[Thread 0x7ffff0b01700 (LWP 32337) exited]
[Inferior 1 (process 32331) exited normally]
$

While for newer versions, a hang occurs:

$ module load mpi/openmpi/gcc-8.1.0/4.0.1
$ mpicxx -g -O0 -std=c++11  main.cpp -o main
$ gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'
Reading symbols from /home/sroberts/Tickets/HPCL3-624/gcc-installs/openmpi-4.0.1-install/bin/orterun...(no debugging symbols found)...done.
Temporary breakpoint 1 at 0x400db0
Starting program: /home/sroberts/Tickets/HPCL3-624/gcc-installs/openmpi-4.0.1-install/bin/mpirun -np 3 ./main
...
disable this warning by setting the OMPI_MPIR_DO_NOT_WARN envar to 1.
--------------------------------------------------------------------------
Detaching after fork from child process 32745.
Detaching after fork from child process 32746.
Detaching after fork from child process 32747.
(hangs forever)
^C
Program received signal SIGINT, Interrupt.
0x00007ffff67fc20d in poll () from /lib64/libc.so.6
A debugging session is active.

        Inferior 1 [process 32735] will be killed.

Quit anyway? (y or n) y
$
@jjhursey
Copy link
Member

I wonder if it's related to this PMIx issue (found while working on a similar MPIR hang):

Can you try using the v3.1.3rc1 to see if it resolves the issue?

@stephen-roberts-work
Copy link
Author

Thanks for the response. The issue doesn't happen on v3.1.3rc1, but then again it doesn't reproduce on the v3.1.2 tag either, so I suspect it may be a different problem.

@jjhursey
Copy link
Member

jjhursey commented May 2, 2019

It's surprising that you could not reproduce with v3.1.2 directly, as we were able to when investigating the similar issue we were tracking.
I'm glad that v3.1.3rc1 did address the issue. I'll bring it over to OMPI in PR for the 4.0.x series and reference this ticket when it's ready.

@hppritcha
Copy link
Member

@jjhursey are you close to getting this issue resolved?

@jjhursey
Copy link
Member

We are waiting for PMIx v3.1.3rc3 to pick up the event handling fix. There were a couple of issues with PMIx v3.1.3rc2 that are being resolved right now. We hope to have that ready by Thursday this week. Once we have it ready, I'll prepare a PR for the OMPI v4.0.x branch.

Per the comment from earlier in the thread, there may be another issue to track down but the PMIx update does make the problem disappear. The PMIx event issue was the bug that I found when working a similar issue with MPIR, TotalView, and the OMPI v4.0.x branch.

@James-A-Clark
Copy link
Contributor

I can confirm that building OMPI v4.0.x with PMIx v3.1.3rc3 fixes this issue.

@James-A-Clark
Copy link
Contributor

@jjhursey will the next v4.0.x release have this version of PMIx?

@jjhursey
Copy link
Member

The PMIx release is being held at the moment so we can investigate a separate issue.

I missed the Open MPI teleconf this week so I'm not sure if they are holding the next v4.0.x release for the PMix official release or not. @gpaulsen @hppritcha might know better.

@gpaulsen
Copy link
Member

gpaulsen commented May 31, 2019 via email

@jjhursey
Copy link
Member

@rhc54 might have an opinion here. We are trying to scope the problem now. In particular, we want to make sure we have a solid understanding of some of the failures we are seeing and if we need to include any more fixes in the 3.1.3 release before it goes out. I don't have a firm date at the moment.

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2019

I'm afraid I don't have anything firm either. I'm hoping we can complete things and do the official release by the end of June, but realistically it may slide into July. However, looking at the milestone schedule, v4.0.2 isn't coming out until end Sept at the earliest, so I don't see us holding that up. We will advise if something unexpected comes up.

@gpaulsen
Copy link
Member

gpaulsen commented Jun 3, 2019

Does this only require a PMIx update or also OMPI code change?
We can consider patching PMIx for an earlier OMPI v4.0.2.

@jjhursey
Copy link
Member

jjhursey commented Jun 4, 2019

It's just a PMIx change. So you need an updated PMIx.

We can give you the rc3, but the problem is that you could knowingly be shipping something may have some other issues. We can discuss this a bit more on the Tuesday call.

@gpaulsen
Copy link
Member

New PMIx update has been merged to v4.0.x

@hppritcha has volunteered to verify this fix, before we close this issue.

@hppritcha
Copy link
Member

verified this works with v4.0.x at e547a2b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants