-
Notifications
You must be signed in to change notification settings - Fork 925
Closed
Closed
Copy link
Description
Background information
What version of Open MPI are you using?
v4.1.2
Describe how Open MPI was installed
From Ubuntu 22.04 package manager: libopenmpi-dev
4.1.2-2ubuntu1 and hwloc
2.7.0-2.
Also reproducible when building from sources in a Docker container:
dpkg-buildpackage commands (click to expand)
echo "deb-src http://archive.ubuntu.com/ubuntu/ jammy universe" >> /etc/apt/sources.list
apt-get update
apt-get install -y fakeroot
apt-get source libopenmpi3
apt-get build-dep -y libopenmpi3
cd openmpi-4.1.2/
dpkg-buildpackage -rfakeroot -b
cp -r debian /local/openmpi-debian-patched # copy to mounted folder to use on host machine
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04
- Computer hardware: reproducible on the following CPUs:
- AMD Ryzen Threadripper 1950X 16-core processor with hyperthreading enabled
- AMD EPYC 7351P 16-core processor with hyperthreading enabled
- Network type: not relevant
Details of the problem
On AMD Ryzen and AMD EPYC, the MCA binding policy "numa" fails to set the processor affinity and generates a fatal error when running the executable in singleton mode. Running the executable with mpiexec -n 1
fixes the error.
MWE:
#include <mpi.h>
int main() {
MPI_Init(NULL, NULL);
MPI_Finalize();
}
Error message:
$ mpicxx mwe.cc
$ OMPI_MCA_hwloc_base_binding_policy="l3cache" ./a.out ; echo $?
0
$ OMPI_MCA_hwloc_base_binding_policy="none" ./a.out ; echo $?
0
$ OMPI_MCA_hwloc_base_binding_policy="core" ./a.out ; echo $?
0
$ OMPI_MCA_hwloc_base_binding_policy="numa" ./a.out ; echo $?
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
Setting processor affinity failed failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[coyote10:741448] Local abort before MPI_INIT completed completed successfully,
but am not able to aggregate error messages, and not able to guarantee that all
other processes were killed!
1
$ OMPI_MCA_hwloc_base_binding_policy="numa" mpiexec -n 1 ./a.out ; echo $?
0
The issue also existed on v4.0.3, but it could be fixed with a binary patch that overrode the value of HWLOC_OBJ_NODE=0xd
by 0xc
at orte/mca/ess/base/ess_base_fns.c#L242 in the libopen-rte.so
file. This is no longer possible in v4.1.2.