Skip to content

v4.x REGRESSION: Updating to PMIx v3.1.0 has an issue #6247

Closed
@rhc54

Description

@rhc54

Courtesy of @amckinstry (filed originally on PMIx repo as openpmix/openpmix#1032):

This is testing within Debian.

3.1.0rc1 works fine; 3.1.0rc2 fails on 32-bit archs.

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918157

This is with openmpi 3.1.3. This will not compile as it stands with rc2 (rc1 was fine), so there was a patch needed:
https://salsa.debian.org/hpc-team/openmpi/blob/debian/master/debian/patches/pmix-modex.patch

Which would be instantly suspect, except the combination works with 64-bit archs (arm64, amd, etc).

The problem is easily reproduced with a simple MPI code on i386:

/*
  sudo apt-get install mpi-default-bin mpi-default-dev
  export OMPI_MCA_plm_rsh_agent=/bin/false
  export OMPI_MCA_rmaps_base_oversubscribe=1
  mpicc -o mpi-test mpi-test.c && mpirun -np 2 ./mpi-test
*/
#include <mpi.h>
int main(int argc, char** argv)
{
  MPI_Init(&argc, &argv);
  MPI_Finalize();
  return 0;
}

giving:

export OMPI_MCA_orte_base_help_aggregate=0
export OMPI_MCA_btl_base_verbose=100
alastair@debian32:~$ mpirun -n 2 -mca btl self ./a.out
[debian32:04634] mca: base: components_register: registering framework btl components
[debian32:04634] mca: base: components_register: found loaded component self
[debian32:04634] mca: base: components_register: component self register function successful
[debian32:04634] mca: base: components_open: opening btl components
[debian32:04634] mca: base: components_open: found loaded component self
[debian32:04634] mca: base: components_open: component self open function successful
[debian32:04634] select: initializing btl component self
[debian32:04634] select: init of component self returned success
[debian32:04635] mca: base: components_register: registering framework btl components
[debian32:04635] mca: base: components_register: found loaded component self
[debian32:04635] mca: base: components_register: component self register function successful
[debian32:04635] mca: base: components_open: opening btl components
[debian32:04635] mca: base: components_open: found loaded component self
[debian32:04635] mca: base: components_open: component self open function successful
[debian32:04635] select: initializing btl component self
[debian32:04635] select: init of component self returned success
[debian32:04634] mca: bml: Using self btl for send to [[38777,1],0] on node debian32
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[38777,1],0]) is on host: debian32
  Process 2 ([[38777,1],1]) is on host: debian32
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[debian32:04634] *** An error occurred in MPI_Init
[debian32:04634] *** reported by process [2541289473,0]
[debian32:04634] *** on a NULL communicator
[debian32:04634] *** Unknown error
[debian32:04634] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[debian32:04634] ***    and potentially your MPI job)
[debian32:04635] mca: bml: Using self btl for send to [[38777,1],1] on node debian32
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[38777,1],1]) is on host: debian32
  Process 2 ([[38777,1],0]) is on host: debian32
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[debian32:04629] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2091

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions