Skip to content

master: pmix MPI spawn / list assertion error #2920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsquyres opened this issue Feb 3, 2017 · 11 comments
Closed

master: pmix MPI spawn / list assertion error #2920

jsquyres opened this issue Feb 3, 2017 · 11 comments
Assignees
Labels

Comments

@jsquyres
Copy link
Member

jsquyres commented Feb 3, 2017

From @siegmargross post on users (https://www.mail-archive.com/[email protected]/msg30564.html):


I have installed openmpi-master-201702010209-6cb484a on my "SUSE Linux Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately, I get errors when I run my spawn programs.

loki spawn 107 mpiexec -np 1 --host loki,loki,nfs1 spawn_intra_comm
Parent process 0: I create 2 slave processes
[nfs1:27716] PMIX ERROR: ERROR in file ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1029
[nfs1:27716] PMIX ERROR: ERROR in file ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server_get.c at line 501
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

 Process 1 ([[42193,2],1]) is on host: nfs1
 Process 2 ([[42193,1],0]) is on host: unknown!
 BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[nfs1:27727] [[42193,2],1] ORTE_ERROR_LOG: Unreachable in file ../../openmpi-master-201702010209-6cb484a/ompi/dpm/dpm.c at line 426
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 ompi_dpm_dyn_init() failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[nfs1:27727] *** An error occurred in MPI_Init
[nfs1:27727] *** reported by process [2765160450,1]
[nfs1:27727] *** on a NULL communicator
[nfs1:27727] *** Unknown error
[nfs1:27727] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[nfs1:27727] ***    and potentially your MPI job)
loki spawn 108

I used the following commands to build and install the package. ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my Linux machine. Options "--enable-mpi-cxx-bindings and
"--enable-mpi-thread-multiple" are now unrecognized. Probably they are now automatically supported. "configure" reports a warning that I should report.

mkdir openmpi-master-201702010209-6cb484a-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-master-201702010209-6cb484a-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-master-201702010209-6cb484a/configure \
 --prefix=/usr/local/openmpi-master_64_cc \
 --libdir=/usr/local/openmpi-master_64_cc/lib64 \
 --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
 --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
 JAVA_HOME=/usr/local/jdk1.8.0_66 \
 LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
 CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
 CPP="cpp" CXXCPP="cpp" \
 --enable-mpi-cxx \
 --enable-mpi-cxx-bindings \
 --enable-cxx-exceptions \
 --enable-mpi-java \
 --enable-mpi-thread-multiple \
 --with-hwloc=internal \
 --without-verbs \
 --with-wrapper-cflags="-m64 -mt" \
 --with-wrapper-cxxflags="-m64" \
 --with-wrapper-fcflags="-m64" \
 --with-wrapper-ldflags="-mt" \
 --enable-debug \
 |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-master_64_cc.old
mv /usr/local/openmpi-master_64_cc /usr/local/openmpi-master_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
...
checking numaif.h usability... no
checking numaif.h presence... yes
configure: WARNING: numaif.h: present but cannot be compiled
configure: WARNING: numaif.h:     check for missing prerequisite headers?
configure: WARNING: numaif.h: see the Autoconf documentation
configure: WARNING: numaif.h:     section "Present But Cannot Be Compiled"
configure: WARNING: numaif.h: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------------------------------ ##
configure: WARNING:     ## Report this to http://www.open-mpi.org/community/help/ ##
configure: WARNING:     ## ------------------------------------------------------ ##
checking for numaif.h... no
...

I get the following errors, if I run "spawn_master" or "spawn_multiple_master".

loki spawn 108 mpiexec -np 1 --host loki,loki,loki,nfs1,nfs1 spawn_master

Parent process 0 running on loki
 I create 4 slave processes

[nfs1:29189] *** Process received signal ***
[nfs1:29189] Signal: Aborted (6)
[nfs1:29189] Signal code:  (-6)
[nfs1:29189] PMIX ERROR: ERROR in file ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1029
[nfs1:29189] PMIX ERROR: ERROR in file ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server_get.c at line 501
[nfs1:29189] PMIX ERROR: ERROR in file ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1029
[nfs1:29189] PMIX ERROR: ERROR in file ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server_get.c at line 501
Warning :: pmix_list_remove_item - the item 0x7f03e001b5b0 is not on the list 0x7f03e8760fc8
orted: ../../../../../../../openmpi-master-201702010209-6cb484a/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server_get.c:587: pmix_pending_resolve: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((pmix_object_t *) (ptr))->obj_magic_id' failed.
[nfs1:29189] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f03eaca5870]
[nfs1:29189] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f03ea9230c7]
[nfs1:29189] [ 2] /lib64/libc.so.6(abort+0x118)[0x7f03ea924478]
[nfs1:29189] [ 3] /lib64/libc.so.6(+0x2e146)[0x7f03ea91c146]
[nfs1:29189] [ 4] /lib64/libc.so.6(+0x2e1f2)[0x7f03ea91c1f2]
[nfs1:29189] [ 5] /usr/local/openmpi-master_64_cc/lib64/openmpi/mca_pmix_pmix2x.so(pmix_pending_resolve+0x2bc)[0x7f03e8382fbc]
[nfs1:29189] [ 6] /usr/local/openmpi-master_64_cc/lib64/openmpi/mca_pmix_pmix2x.so(+0x1557b9)[0x7f03e83837b9]
[nfs1:29189] [ 7] /usr/local/openmpi-master_64_cc/lib64/libopen-pal.so.0(+0x270d2b)[0x7f03ec065d2b]
[nfs1:29189] [ 8] /usr/local/openmpi-master_64_cc/lib64/libopen-pal.so.0(+0x27106a)[0x7f03ec06606a]
[nfs1:29189] [ 9] /usr/local/openmpi-master_64_cc/lib64/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x2d9)[0x7f03ec0669b9]
[nfs1:29189] [10] /usr/local/openmpi-master_64_cc/lib64/openmpi/mca_pmix_pmix2x.so(+0x1e1dc4)[0x7f03e840fdc4]
[nfs1:29189] [11] /lib64/libpthread.so.0(+0x80a4)[0x7f03eac9e0a4]
[nfs1:29189] [12] /lib64/libc.so.6(clone+0x6d)[0x7f03ea9d302d]
[nfs1:29189] *** End of error message ***
Abort
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

 hostname:  nfs1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
loki spawn 109
@artpol84
Copy link
Contributor

artpol84 commented Feb 3, 2017

I can reproduce, checking ...

artpol84 added a commit to artpol84/ompi that referenced this issue Feb 4, 2017
Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.

Addressing open-mpi#2920.
Was introduced in be3ef77.

Signed-off-by: Artem Polyakov <[email protected]>
@artpol84
Copy link
Contributor

artpol84 commented Feb 4, 2017

$ ./mpirun_my.sh 1 ./spawn  
Parent process 0: I create 3 slave processes



Child process 0 running on cn1
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        1

Child process 2 running on cn1
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        3
Parent process 0 running on cn1
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        0
Child process 1 running on cn2
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        2

@sjeaugey
Copy link
Member

sjeaugey commented Feb 6, 2017

Also on MTT, per #2863.
Or maybe it is a new one at a different place. In all cases, MTT still reports it this morning :
https://mtt.open-mpi.org/index.php?do_redir=2388

@artpol84
Copy link
Contributor

artpol84 commented Feb 6, 2017

Yes, it looks very similar. I guess this should fix it.

@artpol84
Copy link
Contributor

artpol84 commented Feb 6, 2017

However this fix doesn't explain list failure, it only removes the problem with missing data.
I think that in addition ORTE probably has problems with handling missing data.

@rhc54
Copy link
Contributor

rhc54 commented Feb 15, 2017

I think I have this fixed with 0c8609c - let's see how MTT does overnight.

@artpol84
Copy link
Contributor

@rhc54 out of curiosity - I don't see any visible changes related to the topic. What exactly fixes the problem?

@rhc54
Copy link
Contributor

rhc54 commented Feb 15, 2017

You have to stop the progress thread prior to tearing down the infrastructure. The list problem was caused by the messaging system continuing to operate in the progress thread while the PMIx_Finalize routine was tearing down the messaging framework. See the changes in the server, client, and tool routines where we now stop the progress thread prior to calling rte_finalize.

@artpol84
Copy link
Contributor

So it doesn't address spawn problem, right?

@rhc54
Copy link
Contributor

rhc54 commented Feb 15, 2017

No - I revised the fix for spawn in #2977

@rhc54
Copy link
Contributor

rhc54 commented Feb 16, 2017

This appears to now be fixed - still seeing the ptl_base_send errors, but that's in a different issue.

@rhc54 rhc54 closed this as completed Feb 16, 2017
bosilca pushed a commit to bosilca/ompi that referenced this issue Mar 7, 2017
Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.

Addressing open-mpi#2920.
Was introduced in be3ef77.

Signed-off-by: Artem Polyakov <[email protected]>
bosilca pushed a commit to bosilca/ompi that referenced this issue Mar 7, 2017
Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.

Addressing open-mpi#2920.
Was introduced in be3ef77.

Signed-off-by: Artem Polyakov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants