Open
Description
Hi Team ,
We have recently installed openmpi v4 and v5 using easybuild in our VM based RHEL 9.5 cluster.
We are able to launch applications using openmpi v4 , though we see some UDP related error messages, i have created a tracker here - #13293
But when we use openmpiv5 to launch the basic hello world application, the mpirun hangs up (i had to terminate using ctrl + c) -
[user1@server1 openmpi-4.1.2]$ mpirun --version
mpirun (Open MPI) 5.0.3
Report bugs to https://www.open-mpi.org/community/help/
[user1@server1 openmpi-4.1.2]$ module list
Currently Loaded Modulefiles:
1) GCCcore/13.3.0 7) libxml2/2.12.7-GCCcore-13.3.0 13) libfabric/1.21.0-GCCcore-13.3.0
2) zlib/1.3.1-GCCcore-13.3.0 8) libpciaccess/0.18.1-GCCcore-13.3.0 14) PMIx/5.0.2-GCCcore-13.3.0
3) binutils/2.42-GCCcore-13.3.0 9) hwloc/2.10.0-GCCcore-13.3.0 15) PRRTE/3.0.5-GCCcore-13.3.0
4) GCC/13.3.0 10) OpenSSL/3 16) UCC/1.3.0-GCCcore-13.3.0
5) numactl/2.0.18-GCCcore-13.3.0 11) libevent/2.1.12-GCCcore-13.3.0 17) OpenMPI/5.0.3-GCC-13.3.0
6) XZ/5.4.5-GCCcore-13.3.0 12) UCX/1.16.0-GCCcore-13.3.0
Key:
auto-loaded
[user1@server1 openmpi-4.1.2]$ timeout 10s mpirun -np 1 ./a.out
--------------------------------------------------------------------------
PRTE has detected that the head of the session directory tree (where
scratch files and shared memory backing storage will be placed)
resides on a shared file system:
Directory: /home/user1/tmp
File system type: nfs
For performance reasons, it is strongly recommended that the session
directory be located on a local file system. This can be controlled by
setting the system temporary directory to be used by PRTE using either
the TMPDIR envar or the "prte_tmpdir_base" MCA param.
If you need the temporary directory to be different on remote nodes
from the local one where prterun is running (e.g., when a login node is
being employed), then you can set the local temporary directory using
the "prte_local_tmpdir_base" MCA param and the one to be used on all
other nodes using the "prte_remote_tmpdir_base" param.
This is only a warning advisory and your job will continue. You can
disable this warning in the future by setting the
"prte_silence_shared_fs" MCA param to "1".
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PRTE has detected that the head of the session directory tree (where
scratch files and shared memory backing storage will be placed)
resides on a shared file system:
Directory: /home/user1/tmp
File system type: nfs
For performance reasons, it is strongly recommended that the session
directory be located on a local file system. This can be controlled by
setting the system temporary directory to be used by PRTE using either
the TMPDIR envar or the "prte_tmpdir_base" MCA param.
If you need the temporary directory to be different on remote nodes
from the local one where prterun is running (e.g., when a login node is
being employed), then you can set the local temporary directory using
the "prte_local_tmpdir_base" MCA param and the one to be used on all
other nodes using the "prte_remote_tmpdir_base" param.
This is only a warning advisory and your job will continue. You can
disable this warning in the future by setting the
"prte_silence_shared_fs" MCA param to "1".
--------------------------------------------------------------------------
Abort is in progress...hit ctrl-c again to forcibly terminate
Here is the list of running processes -
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1531649 user1 20 0 13084 4992 3968 R 12.5 0.0 0:00.03 top -b -n1 -c -u user1
1366288 user1 20 0 87684 12288 6016 S 0.0 0.0 0:06.25 /ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/res -d /ibm/lsf/conf -p 35433 -P -m phch+
1508853 user1 20 0 88116 13192 6016 S 0.0 0.0 0:16.68 /ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/res -d /ibm/lsf/conf -p 34767 -P -m phch+
1508854 user1 20 0 7264 3456 3200 S 0.0 0.0 0:00.00 /bin/sh /home/user1/.lsbatch/1749150277.2384
1508857 user1 20 0 10272 4852 3828 S 0.0 0.0 0:00.03 /bin/bash
1508980 user1 20 0 10472 4992 3840 S 0.0 0.0 0:00.25 bash
1531200 user1 20 0 109304 18432 11392 S 0.0 0.0 0:00.05 prterun -np 1 ./a.out
1531267 user1 20 0 24356 14600 10880 S 0.0 0.0 0:00.14 /usr/lib/systemd/systemd --user
1531269 user1 20 0 181644 9676 1792 S 0.0 0.0 0:00.00 (sd-pam)
1531277 user1 20 0 25348 7700 5632 S 0.0 0.0 0:00.01 sshd: user1@pts/2
1531278 user1 20 0 10276 4608 3712 S 0.0 0.0 0:00.06 -bash
I tried attaching to the prterun executable , and i see -
user@server] strace -p 1531200
strace: Process 1531200 attached
epoll_wait(5,
I tried --mca btl self,sm,tcp but that did not help,
tried gdb -
(gdb) where
#0 0x0000149639f0e78e in epoll_wait () from /lib64/libc.so.6
#1 0x000014963a352443 in epoll_dispatch () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#2 0x000014963a348d35 in event_base_loop () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#3 0x00000000004058c3 in main ()
(gdb) thread apply all bt
Thread 3 (Thread 0x14963831a640 (LWP 1531202) "prterun"):
#0 0x0000149639f0473d in select () from /lib64/libc.so.6
#1 0x000014963a62bf30 in listen_thread () from /common/PRRTE/3.0.5-GCCcore-13.3.0/lib/libprrte.so.3
#2 0x0000149639e8a0ea in start_thread () from /lib64/libc.so.6
#3 0x0000149639f0f150 in clone3 () from /lib64/libc.so.6
Thread 2 (Thread 0x14963851b640 (LWP 1531201) "prterun"):
#0 0x0000149639f0e78e in epoll_wait () from /lib64/libc.so.6
#1 0x000014963a352443 in epoll_dispatch () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#2 0x000014963a348d35 in event_base_loop () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#3 0x000014963a418d79 in progress_engine () from /common/PMIx/5.0.2-GCCcore-13.3.0/lib/libpmix.so.2
#4 0x0000149639e8a0ea in start_thread () from /lib64/libc.so.6
#5 0x0000149639f0f150 in clone3 () from /lib64/libc.so.6
Thread 1 (Thread 0x14963a022740 (LWP 1531200) "prterun"):
#0 0x0000149639f0e78e in epoll_wait () from /lib64/libc.so.6
#1 0x000014963a352443 in epoll_dispatch () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#2 0x000014963a348d35 in event_base_loop () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#3 0x00000000004058c3 in main ()
(gdb)
Please do let me know if there is any other way using which can help narrow down the cause of hang or if reinstall of any component can potentially fix the issue .
Update:
i see same hang issues with prterun.