Description
The MPIR debugging interface is broken in the v4.0.x branch. Attempting to use it causes the mpirun process to hang. This bug is present in v4.0.1 binaries, and freshly compiled version of the v4.0.x head. The bug was not present in the 4.0.0 release build.
This bug was found to reproduce on Centos 7 machine using Intel Compiler icc version 18.2 and also gcc version 8.1. No extra configuration flags were specified.
A git bisect shows the breaking change was 335f8c5: Update to PMIx 3.1.2. A simple reproducer is to compile a hello world_mpi program, then launch mpirun under gdb as follows. Note the behavior for 4.0.0 is correct; the hello_world application runs to completion. In the 4.0.1 case the mpirun process hangs and the hello_world application is unable to progress.
I used the following command to reproduce this issue:
gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'
For 4.0.0, everything works fine:
$ module load mpi/openmpi/gcc-8.1.0/4.0.0
$ mpicxx -g -O0 -std=c++11 main.cpp -o main
$ gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'
Reading symbols from /home/sroberts/Tickets/xxx/gcc-installs/openmpi-4.0.0-install/bin/orterun...(no debugging symbols found)...done.
Temporary breakpoint 1 at 0x400db0
Starting program: /home/sroberts/Tickets/HPCL3-624/gcc-installs/openmpi-4.0.0-install/bin/mpirun -np 3 ./main
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Temporary breakpoint 1, 0x0000000000400db0 in main ()
Continuing.
[New Thread 0x7ffff2b39700 (LWP 32335)]
[New Thread 0x7ffff1d24700 (LWP 32336)]
[New Thread 0x7ffff0b01700 (LWP 32337)]
[New Thread 0x7fffebfff700 (LWP 32338)]
--------------------------------------------------------------------------
Open MPI has detected that you have attached a debugger to this MPI
job, and that debugger is using the legacy "MPIR" method of
attachment.
....
disable this warning by setting the OMPI_MPIR_DO_NOT_WARN envar to 1.
--------------------------------------------------------------------------
Detaching after fork from child process 32339.
Detaching after fork from child process 32340.
Detaching after fork from child process 32341.
Hello World!
SUCCESS
[Thread 0x7ffff2b39700 (LWP 32335) exited]
[Thread 0x7ffff1d24700 (LWP 32336) exited]
[Thread 0x7fffebfff700 (LWP 32338) exited]
[Thread 0x7ffff0b01700 (LWP 32337) exited]
[Inferior 1 (process 32331) exited normally]
$
While for newer versions, a hang occurs:
$ module load mpi/openmpi/gcc-8.1.0/4.0.1
$ mpicxx -g -O0 -std=c++11 main.cpp -o main
$ gdb -q $(which mpirun) -ex 'start -np 3 ./main' -ex 'set MPIR_being_debugged=1' -ex 'continue' -ex 'quit'
Reading symbols from /home/sroberts/Tickets/HPCL3-624/gcc-installs/openmpi-4.0.1-install/bin/orterun...(no debugging symbols found)...done.
Temporary breakpoint 1 at 0x400db0
Starting program: /home/sroberts/Tickets/HPCL3-624/gcc-installs/openmpi-4.0.1-install/bin/mpirun -np 3 ./main
...
disable this warning by setting the OMPI_MPIR_DO_NOT_WARN envar to 1.
--------------------------------------------------------------------------
Detaching after fork from child process 32745.
Detaching after fork from child process 32746.
Detaching after fork from child process 32747.
(hangs forever)
^C
Program received signal SIGINT, Interrupt.
0x00007ffff67fc20d in poll () from /lib64/libc.so.6
A debugging session is active.
Inferior 1 [process 32735] will be killed.
Quit anyway? (y or n) y
$