Skip to content

v2.x: MPI singleton + PMIx dstore fails #2897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kawashima-fj opened this issue Feb 2, 2017 · 4 comments
Closed

v2.x: MPI singleton + PMIx dstore fails #2897

kawashima-fj opened this issue Feb 2, 2017 · 4 comments
Labels
Milestone

Comments

@kawashima-fj
Copy link
Member

@rhc54 As discussed in #2859, when I enable the PMIx dstore, An MPI process of singleton execution (launch directly; no mpiexec) fails with the following message on v2.x branch.

--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  
  orte_ess_init failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
  
  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Bad parameter" (-5) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

The problem will be in fork_hnp function of the singleton ESS. It checks the number of PMIx parameters. But the number varies if dstore is enabled. Probably PMIX_DSTORE_ESH_BASE_PATH is added.

https://github.com/open-mpi/ompi/blob/v2.x/orte/mca/ess/singleton/ess_singleton_module.c#L615

615             if (4 != opal_argv_count(argv)) {
(gdb) n
616                 opal_argv_free(argv);
(gdb) p cptr
$8 = 0x6533cc "PMIX_NAMESPACE=399310849,PMIX_RANK=0,PMIX_SERVER_URI=pmix-server:22133:/tmp/openmpi-sessions-1000@imtofu2_0/6093/pmix-22133,PMIX_SECURITY_MODE=native,PMIX_DSTORE_ESH_BASE_PATH=/tmp/openmpi-sessions-10"...

The master seems to have the solution. Probably a1e8e58. Cherry-picking this commit is sufficient?

@kawashima-fj kawashima-fj added this to the v2.1.0 milestone Feb 2, 2017
@kawashima-fj
Copy link
Member Author

At least 93e7384 and fb5bcc4 is also needed in addition to a1e8e58. I cannot determine other commits in orte/mca/ess/singleton in master is needed. @rhc54 @ggouaillardet Could you take a look?

@ggouaillardet
Copy link
Contributor

@kawashima-fj at first glance, that looks good to me
fwiw, i just noted fb5bcc4 not only plugs a memory leak (as indicated by the commit message), but it also fixes an array overflow when more than 4 PMIX_* environment variables are set.
thanks !

@hppritcha
Copy link
Member

This PR requires a reviewer.

@ggouaillardet
Copy link
Contributor

@hppritcha this is an issue, not a PR
The referenced PR was both reviewed and merged, so i am closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants