Skip to content

v4.x REGRESSION: Updating to PMIx v3.1.0 has an issue #6247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhc54 opened this issue Jan 7, 2019 · 7 comments
Closed

v4.x REGRESSION: Updating to PMIx v3.1.0 has an issue #6247

rhc54 opened this issue Jan 7, 2019 · 7 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Jan 7, 2019

Courtesy of @amckinstry (filed originally on PMIx repo as openpmix/openpmix#1032):

This is testing within Debian.

3.1.0rc1 works fine; 3.1.0rc2 fails on 32-bit archs.

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918157

This is with openmpi 3.1.3. This will not compile as it stands with rc2 (rc1 was fine), so there was a patch needed:
https://salsa.debian.org/hpc-team/openmpi/blob/debian/master/debian/patches/pmix-modex.patch

Which would be instantly suspect, except the combination works with 64-bit archs (arm64, amd, etc).

The problem is easily reproduced with a simple MPI code on i386:

/*
  sudo apt-get install mpi-default-bin mpi-default-dev
  export OMPI_MCA_plm_rsh_agent=/bin/false
  export OMPI_MCA_rmaps_base_oversubscribe=1
  mpicc -o mpi-test mpi-test.c && mpirun -np 2 ./mpi-test
*/
#include <mpi.h>
int main(int argc, char** argv)
{
  MPI_Init(&argc, &argv);
  MPI_Finalize();
  return 0;
}

giving:

export OMPI_MCA_orte_base_help_aggregate=0
export OMPI_MCA_btl_base_verbose=100
alastair@debian32:~$ mpirun -n 2 -mca btl self ./a.out
[debian32:04634] mca: base: components_register: registering framework btl components
[debian32:04634] mca: base: components_register: found loaded component self
[debian32:04634] mca: base: components_register: component self register function successful
[debian32:04634] mca: base: components_open: opening btl components
[debian32:04634] mca: base: components_open: found loaded component self
[debian32:04634] mca: base: components_open: component self open function successful
[debian32:04634] select: initializing btl component self
[debian32:04634] select: init of component self returned success
[debian32:04635] mca: base: components_register: registering framework btl components
[debian32:04635] mca: base: components_register: found loaded component self
[debian32:04635] mca: base: components_register: component self register function successful
[debian32:04635] mca: base: components_open: opening btl components
[debian32:04635] mca: base: components_open: found loaded component self
[debian32:04635] mca: base: components_open: component self open function successful
[debian32:04635] select: initializing btl component self
[debian32:04635] select: init of component self returned success
[debian32:04634] mca: bml: Using self btl for send to [[38777,1],0] on node debian32
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[38777,1],0]) is on host: debian32
  Process 2 ([[38777,1],1]) is on host: debian32
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[debian32:04634] *** An error occurred in MPI_Init
[debian32:04634] *** reported by process [2541289473,0]
[debian32:04634] *** on a NULL communicator
[debian32:04634] *** Unknown error
[debian32:04634] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[debian32:04634] ***    and potentially your MPI job)
[debian32:04635] mca: bml: Using self btl for send to [[38777,1],1] on node debian32
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[38777,1],1]) is on host: debian32
  Process 2 ([[38777,1],0]) is on host: debian32
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[debian32:04629] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2091
@rhc54 rhc54 added this to the v4.0.1 milestone Jan 7, 2019
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 7, 2019

One thing that just caught my eye: this is filed wrt OMPI v3.1, which uses PMIx v2.x - not PMIx v3.x. I suspect this is at least part of the problem. Still, it's worth taking a look to see if there is an issue with OMPI v4.

@amckinstry
Copy link

Do you expect that OMPI 3.1 will support PMIX 3.x in the near future, or just v2.x ? as I said, the combination worked with rc1. I haven't tested OMPI v4 and we're entering a freeze soon in Debian, so Debian 10 will ship with 3.1.x.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jan 8, 2019
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1

Refs. open-mpi#6247

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor

let's start separating the (too many) variables here

  • external PMIx v3.1 is currently not supported (because of the removed macros) and all branches are affected. I made pmix/ext3x: fix support for external PMIx v3.1 #6251 for the master branch, and will backport to the v4.0.x branch once merged.

  • ompi v3.1.x has built in PMIx v2 and the ext2x component for an external PMIx is based on that. ext2x is selected for external PMIx >= 2 (read, there is not a specific ext3x component) and I do not mind backporting the patch to that branch too (that can be seen as a bit fishy, but this is where we are now, and I do not think it is worth breaking it/adding a new component).

  • once patched, I do not have any issue with an external pmix from the latest pmix v3.1 branch in my environment (x86_64 CentOS 7)

  • at that stage, I cannot tell whether the runtime issue is related to (pre) debian 10 and/or 32bits arch.
    I might have no time to sort this out, could you please help me with that ? for example, does Open MPI 3.1.3 with external PMIx 3.1.0rc2 works with debian 9 on a 32 bits arch ? what about a (pre) debian 10 on a 64 bit arch ? (both ideally on intel-like platforms so I can reproduce the issue)

Here is the patch to be applied to the v3.1.x branch (I forgot pmix/ext2x is mainly autogenerated from the pmix/pmix2x glue for the embedded PMIx, sorry for the confusion. that being said, I do not see how the patch for pmix/ext1x is needed here)

diff --git a/opal/mca/pmix/pmix2x/pmix2x.c b/opal/mca/pmix/pmix2x/pmix2x.c
index a00a6e6..141e5df 100644
--- a/opal/mca/pmix/pmix2x/pmix2x.c
+++ b/opal/mca/pmix/pmix2x/pmix2x.c
@@ -1001,10 +1001,6 @@ int pmix2x_value_unload(opal_value_t *kv,
         OPAL_ERROR_LOG(OPAL_ERR_NOT_SUPPORTED);
         rc = OPAL_ERR_NOT_SUPPORTED;
         break;
-    case PMIX_MODEX:
-        OPAL_ERROR_LOG(OPAL_ERR_NOT_SUPPORTED);
-        rc = OPAL_ERR_NOT_SUPPORTED;
-        break;
     case PMIX_PERSIST:
         kv->type = OPAL_PERSIST;
         kv->data.uint8 = pmix2x_convert_persist(v->data.persist);
@@ -1111,10 +1107,6 @@ int pmix2x_value_unload(opal_value_t *kv,
         OPAL_ERROR_LOG(OPAL_ERR_NOT_SUPPORTED);
         rc = OPAL_ERR_NOT_SUPPORTED;
         break;
-    case PMIX_INFO_ARRAY:
-        OPAL_ERROR_LOG(OPAL_ERR_NOT_SUPPORTED);
-        rc = OPAL_ERR_NOT_SUPPORTED;
-        break;
 
     default:
         /* silence warnings */

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jan 8, 2019
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1

The change only need to be done in ext3x/ext3x.c.
But since this file is automatically generated from pmix3x/pmix3x.c, we have to update
the latter file.

Refs. open-mpi#6247

Signed-off-by: Gilles Gouaillardet <[email protected]>

(back-ported from commit open-mpi/ompi@950ba16)
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jan 8, 2019
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1

external PMIx v3 is supported via the pmix/ext2x component, and it has worked so far
until PMIx v3.1 removed some macros, the change to support external PMIx v3.1 is minimal,
so we do not need to bother creating a new pmix/ext3x component.

The change only need to be done in ext2x/ext2x.c.
But since this file is automatically generated from pmix2x/pmix2x.c, we have to update
the latter file.

Refs. open-mpi#6247

Signed-off-by: Gilles Gouaillardet <[email protected]>

(back-ported from commit open-mpi/ompi@950ba16)
@ggouaillardet
Copy link
Contributor

@amckinstry I was able to fix this issue on a 32 bits distro, the fix is in openpmix/openpmix#1036

@hppritcha hppritcha modified the milestones: v4.0.1, v4.0.2 Mar 27, 2019
@hppritcha
Copy link
Member

@ggouaillardet can this be closed?

@gpaulsen
Copy link
Member

@rhc54 Do you think the latest PMIx update on v4.0.x might resolve this?
#6776 ???

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 28, 2019

@gpaulsen I'm not sure what you are asking as there are two issues intermixed on this ticket. The external support issue was addressed by @ggouaillardet and committed back in Jan. The 32-bit issue was also addressed in Jan and included in an earlier PMIx release - the commit is here.

Bottom line: this issue can be closed.

@rhc54 rhc54 closed this as completed Jun 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants