Skip to content

v2.0.x: oshmem: move finalisation to oshmem_onexit #2120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ggouaillardet
Copy link
Contributor

…nexit()

so we can use the legacy start_pes even when Open MPI is compiled with
--enable-static or --disable-visibility

(back-ported from commit 92dd719)

…nexit()

so we can use the legacy start_pes even when Open MPI is compiled with
--enable-static or --disable-visibility

(back-ported from commit 92dd719)
@jsquyres
Copy link
Member

@ggouaillardet Is this a bug fix, or a new feature?

@ggouaillardet
Copy link
Contributor Author

A bug fix

@jsquyres
Copy link
Member

This is the v2.0.x version of #2121.

@jsquyres
Copy link
Member

Ok, it's a bug fix. This is new behavior, though -- I think we'll need a good amount of testing to know that our shutdown procedures are safe during the process destructor.

Question: When will this destructor fire?

  • When the process exits normally (either via exit(0) or returning with a zero status from main)
  • When the process exits with a non-zero exit status (either via exit(1) or returning with a non-zero status from main)
  • When the process aborts (e.g., via signal)

@jladd-mlnx
Copy link
Member

@jsquyres Answers to your questions. See shmem_finalize definition below.

 27 void shmem_finalize(void)
 28 {
 29     OPAL_CR_FINALIZE_LIBRARY();
 30     if (oshmem_shmem_globalexit_status != 0)
 31     {
 32         return;
 33     }
 34     oshmem_shmem_finalize();
 35 }
 36 

There is no longer a destructor. oshmem_shmem_finalize is invoked only if all processes exit normally with a non-zero exit status. If any process exits with a non-zero exit status for any reason (e.g. a call to shmem_global_exit, process aborts (e.g. via signal)), then all processes skip oshmem_shmem_finalize.

@igor-ivanov
Copy link
Member

igor-ivanov commented Sep 27, 2016

This change is valid in case on_exit possibility exists only. See

AC_CHECK_FUNCS([on_exit])

@jladd-mlnx @ggouaillardet @jsquyres
Do you think it is reasonable limitation?

@ggouaillardet
Copy link
Contributor Author

oshmem is Linux only, right ?
We could fall back to atexit if on_exit is not available

@ggouaillardet
Copy link
Contributor Author

Note that currently, and if I understand correctly, if on_exit() is not available, then oshmem_shmem_globalexit_status is not set and as a consequence, MPI_Finalize() is not invoked.
Should we change this behavior ?
shmem_finalize() calls MPI_Finalize(), and the callback invokes shmem_finalize() only if not invoked before (e.g. exit on error, or if the legacy start_pe() is used.
IIRC, Open MPI has a different error message if MPI_Finalize() is not invoked at all, or if a task exits with a non zero exit code after MPI_Finalize()

@igor-ivanov
Copy link
Member

@ggouaillardet in original implementation shmem_finalize() could be called as destructor, directly using oshmem_finalize() or as result of oshmem_globalexit()->on_exit().
Initially on_exit() was added to support oshmem_globalexit() function.
As far as oshmem is part of ompi and can not be used w/o one we should call ompi initalization and finalization.
Real application can call shmem_init() only, shmem_init()-MPI_Init(), MPI_init()-shmem_init() to use shmem and mpi in single application.

@jsquyres
Copy link
Member

Copying @ggouaillardet's comment from #2121 (comment), just so that the entire conversation is here on this PR:


for the record, my initial analysis was incorrect.

i thought shmem_finalize() was never invoked under the hood if legacy start_pes() is used.
i ran under the debugger and found that the destructor is indeed always invoked, even if only static libraries are built.
what i missed is that opal_class_finalize(), which is also a destructor, is invoked before shmem_finalize(), which was also a destructor, so shmem_finalize() crashes when destroying objects.
/* this is true at least when i compile with --disable-shared --enable-static --disable-dlopen and in my environment */

i do not know if/how destructors can be ordered in the case of static libraries.

having shmem_finalize() invoked at exit (via on_exit()) ensures it is invoked before opal_class_finalize() so everything is fine.

@jsquyres
Copy link
Member

@ggouaillardet So I'm getting a little lost in the conversation. Does this PR fix a problem, or does it just change the mechanism for how the shutdown stuff is invoked?

@ggouaillardet
Copy link
Contributor Author

@jsquyres it does solve a problem
when configure'd with --disable-shared --enable-static, an oshmem app crashes when exiting if oshmem destructor is invoked after `opal`` destructor.
/* and i do not know who decides in which order destructors are invoked */

@igor-ivanov thanks for the comments, i will review them.
at first glance, i suspect MPI_Init(); shmem_init(); shmem_finalize(); MPI_Finalize(); nor shmem_init(); MPI_Init(); MPI_Finalize(); shmem_finalize(); can work correctly, with or without this PR.

@ggouaillardet
Copy link
Contributor Author

@igor-ivanov an other option that does not require on_exit() is to have the opal destructor invoke shmem_finalize() via a callback mechanism to be implemented.
That would avoid having two destructors that must be invoked in a specific order, and I do not know how to ensure that.

@igor-ivanov
Copy link
Member

@ggouaillardet may be we could use
attribute((constructor (PRIORITY)))
attribute((destructor (PRIORITY)))

see https://phoxis.org/2011/04/27/c-language-constructors-and-destructors-with-gcc/

@ggouaillardet
Copy link
Contributor Author

@igor-ivanov thanks for the pointer, I was not aware of this.
That being said, the link suggests this is GCC only, so unless all major compilers support it, we might not be able to use it yet.

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2016

We talked about this again on the call today. Everyone seems either neutral or ok with it: it solves a corner case that not many people will run into, but some will. So let's do it.

@jsquyres jsquyres merged commit c0f6d69 into open-mpi:v2.0.x Oct 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants