-
-
Notifications
You must be signed in to change notification settings - Fork 55
[WIP] Support ULFM2 #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support ULFM2 #624
Conversation
Codecov Report
@@ Coverage Diff @@
## master #624 +/- ##
=======================================
Coverage 55.73% 55.73%
=======================================
Files 4 4
Lines 3158 3158
=======================================
Hits 1760 1760
Misses 1398 1398 |
@nathanweeks any updates on this PR? @zbeekman and I are working on a new release. We won't merge this PR until the WIP is removed, but I just thought I'd check with you in case it's ready for release. |
@rouson : No updates yet. The aforementioned ULFM2 MPI-thread-level issue is still a blocker; if/when that is resolved, it's possible additional changes to mpi_caf.c may be needed before all tests pass & ULFM2 support could be advertised. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
@nathanweeks shall we close this PR or do you have plans to finish it and remove the "WIP" from the title? There have been several commits to the |
Hi @rouson, I'd like an opportunity to test/revise sometime after the PR to add ULFM support to Open-MPI (open-mpi/ompi#7740 ) is merged. |
@nathanweeks The PR to add ULFM support to Open-MPI (open-mpi/ompi#7740 ) was merged was merged on February 10. Are you still interested in testing/revising this PR? |
@rouson: I'm definitely still interested in testing & merging this PR. It's been a while since I've attempted building OpenCoarrays using Open MPI built from the master branch; IIRC the last time I tried (over a year ago) I was blocked by open-mpi/ompi#8086 when attempting to run any code. I'll check to see if this issue has been resolved with the current Open MPI, and if so try to finish this PR before SC21. |
Open-MPI doesn't support the deprecated MPI_Errhandler_set() unless configured with --enable-mpi1-compatibility
Change (MPICH-specific) --reenable-auto-cleanup cafrun option to --disable-failed-images, and select appropriate mpiexec option based on MPI implementation (applicable only when CAF_ENABLE_FAILED_IMAGES=TRUE)
Open MPI mpiexec --enable-recovery to enable recovery from image failure
cda749c
to
3e8f663
Compare
Thanks, @nathanweeks. Let me know if you think this is ready to merge. On a related note, a question came up at last week's Fortran committee meeting about whether the MPI 4.0 standard supports fault tolerance. I think not. Do you know whether the features in ULFM2 are headed toward inclusion in a future MPI standard? |
HI @rouson , the failed images tests (and several tests that exercise SYNC IMAGES) fail when using Open MPI v5.0.0rc2. Reproducible using the following Dockerfile (which TBH doesn't need to involve a multi-stage build for testing):
Writing the above Dockerfile in the root of a git working tree for this PR's branch & doing a
Per our discussion, some additional modifications from an OpenCoarrays fork I had worked on as part of my graduate research (and tested with ULFM2 before that fork was merged back into Open MPI) that implemented a subset of Fortran 2018 failed images semantics might be needed---though that OpenCoarrays fork depended on patches to the GFortran front end to support a little extra necessary Fortran 2018 syntax. I'll plan to update you offline regarding this. |
@nathanweeks thanks for the update. |
@nathanweeks any thoughts on whether we should keep this PR open or close it? |
@rouson as OpenMPI still isn't quite at 5.0.0 yet (https://github.com/open-mpi/ompi/milestone/37), and additional changes to OpenCoarrays (and GFortran) would be needed to implement a useful subset of Fortran 2018 failed images semantics, it might be prudent to scale back this draft PR to include only what's needed to support Open-MPI without failed images. If that sounds OK, I will plan to propose modifications to this PR this weekend. Either way, any guidance is appreciated.
|
@nathanweeks I defer to your judgement. That sounds like a good plan, but I don't have a very deep appreciation of the considerations involved or the effort. Feel free to proceed as you see fit. I'll look forward to the contribution. |
@rouson I'll suggest closing this one for now. The It might be worth reconsidering this PR in the future if there's further development on the failed images implementation in OpenCoarrays. And I'm still planning to re-submit GFortran patch for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87939 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87326 after the GCC 13 branch is created (I missed the deadline for GCC 12 due to problems with a couple GFortran tests), which could help facilitate this. |
Summary of Changes
Preliminary ULFM2 support when OpenCoarrays is configured with -DCAF_ENABLE_FAILED_IMAGES=TRUE.
MPI_Errhandler_set()
withMPI_Comm_set_errhandler()
, as Open-MPI (which ULFM2 is forked from) doesn't support the former by default (only if configured with--enable-mpi1-compatibility
).--disable-auto-cleanup
to the mpiexec command argument list. This pull request suggests changing the MPICH-specific--reenable-auto-cleanup
option to the self-descriptive--disable-failed-images
, and internally adding--disable-recovery
to the mpiexec argument list if using ULFM2, else--reenable-auto-cleanup
.Note that not all of the regression tests pass; in particular, the failed images tests seem to fail due to a problem with ULFM2 + an MPI thread level > MPI_THREAD_SINGLE (https://bitbucket.org/icldistcomp/ulfm2/issues/41).
Rationale for changes
ULFM2 provides an alternate ULFM implementation for failed images support.
Additional info and certifications
This pull request (PR) is a:
I certify that
- Increasing test coverage for all feature-addition PRs
- Increasing test coverage for all bug-fix PRs for which there
does not already exist a related test that failed before the PR
- At least maintaining test coverage for all other PRs
- Ensuring that all tests pass when run locally
- Naming PR to indicate work in progress (WIP) and to attach the PR
to the appropriate bug report or feature request [issue]
- White space (no trailing white space or white space errors may
be introduced)
- Commenting code where it is non-obvious and non-trivial
- Logically atomic, self consistent and coherent commits
- Commit message content
- Waiting 24 hours before self-approving the PR to give another
OpenCoarrays developer a chance to review my proposed code