Skip to content

openmpi v4 launch error messages #13293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
puneet336 opened this issue Jun 5, 2025 · 5 comments
Open

openmpi v4 launch error messages #13293

puneet336 opened this issue Jun 5, 2025 · 5 comments

Comments

@puneet336
Copy link

puneet336 commented Jun 5, 2025

Hi Team ,
We had installed openmpi 4.1.6 and openmpi v5 using easy build scripts on a VM based cluster on RHEL 9.5 based cluster.
https://docs.easybuild.io/version-specific/supported-software/o/OpenMPI/
Each server in VM based cluster has following 3 interfaces :

NAME                   TYPE      DEVICE
System ens192    ethernet  ens192
System ens161    ethernet  ens161
System ens256    ethernet  ens256

when we run the openmpi v4 , we are able to perform multinode runs , but we see lots of error messages at the beginning and end of the expected output.

[user@server1 mpi]$ mpirun --version
mpirun (Open MPI) 4.1.6
Report bugs to http://www.open-mpi.org/community/help/
[suser@server1 mpi]$  mpirun -np 4 ./a.out
< error messages>
Hello world from processor server1 rank 1 out of 4 processors
Hello world from processor server2, rank 3 out of 4 processors
Hello world from processor server3 rank 2 out of 4 processors
Hello world from processor server4 rank 0 out of 4 processors
<error messages>

Based on the output error messages, i see 2 category of issues -
issue 1)

                   server1:rank1: PSM3 can't open nic unit: -1 (err=23)
[1749130772.508535285] server1:rank0.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.508552855] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749130772.510515510] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.510530731] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749130772.512386528] server1:rank0.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749130772.512401122] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server1:rank0: PSM3 can't open nic unit: -1 (err=23)
[1749130772.542387747] server2:rank2.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.542512255] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749130772.543819831] server2:rank3.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.543838127] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749130772.544649294] server2:rank2.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.544700905] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749130772.545667705] server2:rank3.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol



[1749130772.545685370] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749130772.546689089] server2:rank2.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749130772.546706079] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server2:rank2: PSM3 can't open nic unit: -1 (err=23)
[1749130772.547558723] server2:rank3.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749130772.547571518] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server2:rank3: PSM3 can't open nic unit: -1 (err=23)
[1749130772.549656387] server2:rank2.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.549732373] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
server2:rank2: PSM3 can't open nic unit: 0 (err=23)
[1749130772.550315258] server2:rank3.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.550334655] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
server2:rank3: PSM3 can't open nic unit: 0 (err=23)
[1749130772.552612768] server2:rank2.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.552688696] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
server2:rank2: PSM3 can't open nic unit: 1 (err=23)[1749130772.552796054] server2:rank3.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol

issue 2) at the end of run i see following message -

Hello world from processor server1, rank 0 out of 4 processors
[server1:3799223] PMIX ERROR: PMIX_ERR_NO_PERMISSIONS in file dstore_base.c at line 238

I am attaching the complete stdout herewith, ompiv4_error.txt
Please do let me know if any further information is required from my end.

@rhc54
Copy link
Contributor

rhc54 commented Jun 5, 2025

Sometimes cannot use the PMIx shared memory in a VM, so set PMIX_MCA_gds=hash in your environment.

Only UDP I see in OMPI v4 (doing a really quick grep) is in the USNIC BTL, so try adding --mca btl ^usnic to your cmd line. Or if that doesn't work, try --mca btl self,sm,tcp.

@puneet336
Copy link
Author

Thank you for the response @rhc54

  1. the PMIX_MCA_gds adressed the dstore_base.c issue -
    without PMIX_MCA_gds=hash:
[1749150568.499905892] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749150568.499928562] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749150568.501945029] server1:rank0.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749150568.501962452] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server1:rank0: PSM3 can't open nic unit: -1 (err=23)
Hello world from processor server1, rank 0 out of 1 processors
[server1:1510238] PMIX ERROR: PMIX_ERR_NO_PERMISSIONS in file dstore_base.c at line 238
[user1@server1 openmpi-4.1.2]$

with PMIX_MCA_gds=hash:


[user1@server1 openmpi-4.1.2]$ export PMIX_MCA_gds=hash
[user1@server1 openmpi-4.1.2]$ mpirun -np 1 ./a.out
...
[1749150733.148099220] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749150733.148114471] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
server1:rank0: [1749150733.150221377] server1:rank0.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749150733.150236851] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
PSM3 can't open nic unit: -1 (err=23)
Hello world from processor server1, rank 0 out of 1 processors
[user1@server1 openmpi-4.1.2]$
  1. the --mca btl ^usnic did not supress / resolve the UDP socket related issues :
[user1@server1 openmpi-4.1.2]$ mpirun --mca btl ^usnic -np 1 ./a.out
[1749150874.624514967] server1:rank0.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749150874.624541104] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749150874.626610519] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749150874.626626542] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
server1:rank0: PSM3 can't open nic unit: -1 (err=23)
...
Hello world from processor server1, rank 0 out of 1 processors

with --mca btl self,sm,tc , mpirun failed to launch -

[user1@server1 openmpi-4.1.2]$ mpirun --mca btl self,sm,tc -np 1 ./a.out
[user1@server1 openmpi-4.1.2]$ echo $?
1

@rhc54
Copy link
Contributor

rhc54 commented Jun 5, 2025

It's --mca btl self,sm,tcp, not --mca btl self,sm,tc. You dropped the "p"

@puneet336
Copy link
Author

puneet336 commented Jun 5, 2025

Thank you , i retried -

[user1@server1 openmpi-4.1.2]$ mpirun --mca btl self,sm,tcp -np 1 ./a.out
[user1@server1 openmpi-4.1.2]$ echo $?
1

and I get same issue.

when i remove sm , launch works

user1@server1 openmpi-4.1.2]$ mpirun --mca btl self -np 1 ./a.out
[1749154303.002687282] server1:rank0.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749154303.002712182] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749154303.006382453] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749154303.006402394] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749154303.008790739] server1:rank0.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749154303.008818410] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server1:rank0: PSM3 can't open nic unit: -1 (err=23)
Hello world from processor server1, rank 0 out of 1 processors

with sm -

[user1@server1 openmpi-4.1.2]$ mpirun --mca btl self,sm -np 1 ./a.out
[user1@server1 openmpi-4.1.2]$ echo $?
1

without sm

[user1@server1 openmpi-4.1.2]$ mpirun --mca btl self,tcp -np 1 ./a.out
[1749154335.260077309] server1:rank0.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749154335.260100921] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749154335.262587000] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749154335.262603231] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749154335.264739247] server1:rank0.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749154335.264754894] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server1:rank0: PSM3 can't open nic unit: -1 (err=23)
Hello world from processor server1, rank 0 out of 1 processors
[user1@server1 openmpi-4.1.2]$ echo $?
0

@jsquyres
Copy link
Member

jsquyres commented Jun 6, 2025

The UDP messages appear to be coming from the PSM3 library:

server1:rank0: PSM3 can't open nic unit: -1 (err=23)

You mention ethernet interfaces, but didn't mention anything more specific than that (there's several OS-bypass / HPC-quality ethernet-based hardware platforms available -- PSM3 is one of them).

Meaning: if you have networking hardware that can utilize the PSM3 library, then the PSM3 stack isn't installed or configured properly because it apparently isn't able to open the NICs successfully. You'll need to investigate your hardware / PSM3 documentation to resolve that; we can't help with that.

If you don't have PSM3-capable hardware, then you should probably remove the PSM3 library from your systems to avoid confusion (depending on what layer is using it, you may need to rebuild Open MPI).


You're also having shared memory problems in #13294. I'm going to take a guess: you should follow what the error messages are telling you in that ticket and have a TMPDIR on a non-NFS directory. Weird (i.e., bad) things can happen when trying to mount shared memory on NFS-based filesystems.

Doing so may make the vader (i.e., sm) shared memory module in Open MPI v4.1.x work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants