Skip to content

openmpiv5 hangs up #13294

Open
Open
@puneet336

Description

@puneet336

Hi Team ,
We have recently installed openmpi v4 and v5 using easybuild in our VM based RHEL 9.5 cluster.
We are able to launch applications using openmpi v4 , though we see some UDP related error messages, i have created a tracker here - #13293

But when we use openmpiv5 to launch the basic hello world application, the mpirun hangs up (i had to terminate using ctrl + c) -

[user1@server1 openmpi-4.1.2]$ mpirun --version
mpirun (Open MPI) 5.0.3

Report bugs to https://www.open-mpi.org/community/help/
[user1@server1 openmpi-4.1.2]$ module list
Currently Loaded Modulefiles:
 1) GCCcore/13.3.0                  7) libxml2/2.12.7-GCCcore-13.3.0       13) libfabric/1.21.0-GCCcore-13.3.0
 2) zlib/1.3.1-GCCcore-13.3.0       8) libpciaccess/0.18.1-GCCcore-13.3.0  14) PMIx/5.0.2-GCCcore-13.3.0
 3) binutils/2.42-GCCcore-13.3.0    9) hwloc/2.10.0-GCCcore-13.3.0         15) PRRTE/3.0.5-GCCcore-13.3.0
 4) GCC/13.3.0                     10) OpenSSL/3                           16) UCC/1.3.0-GCCcore-13.3.0
 5) numactl/2.0.18-GCCcore-13.3.0  11) libevent/2.1.12-GCCcore-13.3.0      17) OpenMPI/5.0.3-GCC-13.3.0
 6) XZ/5.4.5-GCCcore-13.3.0        12) UCX/1.16.0-GCCcore-13.3.0

Key:
auto-loaded
[user1@server1 openmpi-4.1.2]$ timeout 10s mpirun -np 1 ./a.out
--------------------------------------------------------------------------
PRTE has detected that the head of the session directory tree (where
scratch files and shared memory backing storage will be placed)
resides on a shared file system:

   Directory: /home/user1/tmp
   File system type: nfs

For performance reasons, it is strongly recommended that the session
directory be located on a local file system. This can be controlled by
setting the system temporary directory to be used by PRTE using either
the TMPDIR envar or the "prte_tmpdir_base" MCA param.

If you need the temporary directory to be different on remote nodes
from the local one where prterun is running (e.g., when a login node is
being employed), then you can set the local temporary directory using
the "prte_local_tmpdir_base" MCA param and the one to be used on all
other nodes using the "prte_remote_tmpdir_base" param.

This is only a warning advisory and your job will continue. You can
disable this warning in the future by setting the
"prte_silence_shared_fs" MCA param to "1".
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PRTE has detected that the head of the session directory tree (where
scratch files and shared memory backing storage will be placed)
resides on a shared file system:

   Directory: /home/user1/tmp
   File system type: nfs

For performance reasons, it is strongly recommended that the session
directory be located on a local file system. This can be controlled by
setting the system temporary directory to be used by PRTE using either
the TMPDIR envar or the "prte_tmpdir_base" MCA param.

If you need the temporary directory to be different on remote nodes
from the local one where prterun is running (e.g., when a login node is
being employed), then you can set the local temporary directory using
the "prte_local_tmpdir_base" MCA param and the one to be used on all
other nodes using the "prte_remote_tmpdir_base" param.

This is only a warning advisory and your job will continue. You can
disable this warning in the future by setting the
"prte_silence_shared_fs" MCA param to "1".
--------------------------------------------------------------------------



Abort is in progress...hit ctrl-c again to forcibly terminate

Here is the list of running processes -

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1531649 user1  20   0   13084   4992   3968 R  12.5   0.0   0:00.03 top -b -n1 -c -u user1
1366288 user1  20   0   87684  12288   6016 S   0.0   0.0   0:06.25 /ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/res -d /ibm/lsf/conf -p 35433 -P -m phch+
1508853 user1  20   0   88116  13192   6016 S   0.0   0.0   0:16.68 /ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/res -d /ibm/lsf/conf -p 34767 -P -m phch+
1508854 user1  20   0    7264   3456   3200 S   0.0   0.0   0:00.00 /bin/sh /home/user1/.lsbatch/1749150277.2384
1508857 user1  20   0   10272   4852   3828 S   0.0   0.0   0:00.03 /bin/bash
1508980 user1  20   0   10472   4992   3840 S   0.0   0.0   0:00.25 bash
1531200 user1  20   0  109304  18432  11392 S   0.0   0.0   0:00.05 prterun -np 1 ./a.out
1531267 user1  20   0   24356  14600  10880 S   0.0   0.0   0:00.14 /usr/lib/systemd/systemd --user
1531269 user1  20   0  181644   9676   1792 S   0.0   0.0   0:00.00 (sd-pam)
1531277 user1  20   0   25348   7700   5632 S   0.0   0.0   0:00.01 sshd: user1@pts/2
1531278 user1  20   0   10276   4608   3712 S   0.0   0.0   0:00.06 -bash

I tried attaching to the prterun executable , and i see -

user@server] strace -p 1531200
strace: Process 1531200 attached
epoll_wait(5,

I tried --mca btl self,sm,tcp but that did not help,

tried gdb -

(gdb) where
#0  0x0000149639f0e78e in epoll_wait () from /lib64/libc.so.6
#1  0x000014963a352443 in epoll_dispatch () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#2  0x000014963a348d35 in event_base_loop () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#3  0x00000000004058c3 in main ()
(gdb) thread apply all bt

Thread 3 (Thread 0x14963831a640 (LWP 1531202) "prterun"):
#0  0x0000149639f0473d in select () from /lib64/libc.so.6
#1  0x000014963a62bf30 in listen_thread () from /common/PRRTE/3.0.5-GCCcore-13.3.0/lib/libprrte.so.3
#2  0x0000149639e8a0ea in start_thread () from /lib64/libc.so.6
#3  0x0000149639f0f150 in clone3 () from /lib64/libc.so.6

Thread 2 (Thread 0x14963851b640 (LWP 1531201) "prterun"):
#0  0x0000149639f0e78e in epoll_wait () from /lib64/libc.so.6
#1  0x000014963a352443 in epoll_dispatch () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#2  0x000014963a348d35 in event_base_loop () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#3  0x000014963a418d79 in progress_engine () from /common/PMIx/5.0.2-GCCcore-13.3.0/lib/libpmix.so.2
#4  0x0000149639e8a0ea in start_thread () from /lib64/libc.so.6
#5  0x0000149639f0f150 in clone3 () from /lib64/libc.so.6

Thread 1 (Thread 0x14963a022740 (LWP 1531200) "prterun"):
#0  0x0000149639f0e78e in epoll_wait () from /lib64/libc.so.6
#1  0x000014963a352443 in epoll_dispatch () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#2  0x000014963a348d35 in event_base_loop () from /home//commonuser1/MySoftwares/ebinstall_top/software/libevent/2.1.12-GCCcore-13.3.0/lib/libevent_core-2.1.so.7
#3  0x00000000004058c3 in main ()
(gdb)

Please do let me know if there is any other way using which can help narrow down the cause of hang or if reinstall of any component can potentially fix the issue .

Update:
i see same hang issues with prterun.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions