Skip to content

[WIP] Support ULFM2 #624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

nathanweeks
Copy link
Contributor

@nathanweeks nathanweeks commented Jan 6, 2019

Summary of Changes

Preliminary ULFM2 support when OpenCoarrays is configured with -DCAF_ENABLE_FAILED_IMAGES=TRUE.

  • Enable building OpenCoarrays with ULFM2 by replacing the deprecated MPI_Errhandler_set() with MPI_Comm_set_errhandler(), as Open-MPI (which ULFM2 is forked from) doesn't support the former by default (only if configured with --enable-mpi1-compatibility).
  • Support the ULFM2 mpiexec in cafrun by not adding --disable-auto-cleanup to the mpiexec command argument list. This pull request suggests changing the MPICH-specific --reenable-auto-cleanup option to the self-descriptive --disable-failed-images, and internally adding --disable-recovery to the mpiexec argument list if using ULFM2, else --reenable-auto-cleanup.

Note that not all of the regression tests pass; in particular, the failed images tests seem to fail due to a problem with ULFM2 + an MPI thread level > MPI_THREAD_SINGLE (https://bitbucket.org/icldistcomp/ulfm2/issues/41).

Rationale for changes

ULFM2 provides an alternate ULFM implementation for failed images support.

Additional info and certifications

This pull request (PR) is a:

  • Bug fix
  • Feature addition
  • Other, Please describe:

I certify that

  • I reviewed and followed the [contributing guidelines], including
    - Increasing test coverage for all feature-addition PRs
    - Increasing test coverage for all bug-fix PRs for which there
    does not already exist a related test that failed before the PR
    - At least maintaining test coverage for all other PRs
    - Ensuring that all tests pass when run locally
    - Naming PR to indicate work in progress (WIP) and to attach the PR
    to the appropriate bug report or feature request [issue]
    - White space (no trailing white space or white space errors may
    be introduced)
    - Commenting code where it is non-obvious and non-trivial
    - Logically atomic, self consistent and coherent commits
    - Commit message content
    - Waiting 24 hours before self-approving the PR to give another
    OpenCoarrays developer a chance to review my proposed code

@ghost ghost added the needs-review label Jan 6, 2019
@codecov
Copy link

codecov bot commented Jan 6, 2019

Codecov Report

Merging #624 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #624   +/-   ##
=======================================
  Coverage   55.73%   55.73%           
=======================================
  Files           4        4           
  Lines        3158     3158           
=======================================
  Hits         1760     1760           
  Misses       1398     1398

@rouson
Copy link
Member

rouson commented Jan 14, 2019

@nathanweeks any updates on this PR? @zbeekman and I are working on a new release. We won't merge this PR until the WIP is removed, but I just thought I'd check with you in case it's ready for release.

@nathanweeks
Copy link
Contributor Author

@rouson : No updates yet. The aforementioned ULFM2 MPI-thread-level issue is still a blocker; if/when that is resolved, it's possible additional changes to mpi_caf.c may be needed before all tests pass & ULFM2 support could be advertised.

@ghost ghost assigned rouson Mar 10, 2019
@stale
Copy link

stale bot commented Apr 7, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@rouson
Copy link
Member

rouson commented Sep 30, 2020

@nathanweeks shall we close this PR or do you have plans to finish it and remove the "WIP" from the title? There have been several commits to the master branch since this PR was created so it will need to be updated if it's still intended to be merged at some point.

@nathanweeks
Copy link
Contributor Author

Hi @rouson, I'd like an opportunity to test/revise sometime after the PR to add ULFM support to Open-MPI (open-mpi/ompi#7740 ) is merged.

@rouson
Copy link
Member

rouson commented Oct 20, 2021

@nathanweeks The PR to add ULFM support to Open-MPI (open-mpi/ompi#7740 ) was merged was merged on February 10. Are you still interested in testing/revising this PR?

@nathanweeks
Copy link
Contributor Author

@rouson: I'm definitely still interested in testing & merging this PR. It's been a while since I've attempted building OpenCoarrays using Open MPI built from the master branch; IIRC the last time I tried (over a year ago) I was blocked by open-mpi/ompi#8086 when attempting to run any code. I'll check to see if this issue has been resolved with the current Open MPI, and if so try to finish this PR before SC21.

Open-MPI doesn't support the deprecated MPI_Errhandler_set() unless
configured with --enable-mpi1-compatibility
Change (MPICH-specific) --reenable-auto-cleanup cafrun option to
--disable-failed-images, and select appropriate mpiexec option based on
MPI implementation (applicable only when CAF_ENABLE_FAILED_IMAGES=TRUE)
Open MPI mpiexec --enable-recovery to enable recovery from image failure
@nathanweeks nathanweeks force-pushed the issue-623-ulfm2-cafrun branch from cda749c to 3e8f663 Compare October 23, 2021 15:05
@rouson
Copy link
Member

rouson commented Oct 23, 2021

Thanks, @nathanweeks. Let me know if you think this is ready to merge. On a related note, a question came up at last week's Fortran committee meeting about whether the MPI 4.0 standard supports fault tolerance. I think not. Do you know whether the features in ULFM2 are headed toward inclusion in a future MPI standard?

@nathanweeks
Copy link
Contributor Author

HI @rouson , the failed images tests (and several tests that exercise SYNC IMAGES) fail when using Open MPI v5.0.0rc2. Reproducible using the following Dockerfile (which TBH doesn't need to involve a multi-stage build for testing):

FROM ubuntu:20.04 AS openmpi

RUN apt update && apt install -y --no-install-recommends \
  autoconf \
  automake \
  ca-certificates \
  flex \
  g++-10 \
  gcc-10 \ 
  gfortran-10 \ 
  git \
  libtool \
  make \
  openssh-client \
  python3

WORKDIR /src

ARG REF=v5.0.0rc2

RUN git init \
  && git remote add origin https://github.com/open-mpi/ompi.git \
  && git fetch --depth=1 origin --tags ${REF}  \
  && git checkout ${REF} \
  && git submodule update --init --recursive --depth=1 \
  && ./autogen.pl

RUN sh ./configure --disable-io-romio --disable-man-pages CC=gcc-10 CXX=g++-10 FC=gfortran-10 \
  && make -j \
  && make install \
  && ldconfig

FROM ubuntu:20.04 AS opencoarrays

RUN apt update && apt install -y --no-install-recommends \
  cmake \
  gcc-10 \
  gfortran-10 \
  make \
  openssh-client \
  && rm -rf /var/lib/apt/lists/*

COPY --from=openmpi /usr/local/ /usr/local/
RUN ldconfig

COPY . /src
WORKDIR /src/build

# environment variables that need to be set before build, else `mpiexec
# --version` fails, openmpi is not detected, and build defaults to mpich
ENV OMPI_ALLOW_RUN_AS_ROOT=1 \
    OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
    PMIX_MCA_pcompress_base_silence_warning=1

RUN cmake .. -DCMAKE_BUILD_TYPE=Debug \
             -DCAF_ENABLE_FAILED_IMAGES=TRUE \
             -DCMAKE_C_COMPILER=gcc-10 \
             -DCMAKE_Fortran_COMPILER=gfortran-10 \
  && make -j \
  && ctest --timeout 5 --output-on-failure

Writing the above Dockerfile in the root of a git working tree for this PR's branch & doing a docker build reveals failures in the following tests:

$ docker build -t opencoarrays:issue-623-ulfm2-cafrun .
...
#16 68.37 87% tests passed, 12 tests failed out of 91
#16 68.37
#16 68.37 Total Test time (real) =  68.21 sec
#16 68.37
#16 68.37 The following tests FAILED:
#16 68.37        51 - syncimages_status (Timeout)
#16 68.37        52 - sync_ring_abort_np3 (Timeout)
#16 68.37        53 - sync_ring_abort_np7 (Timeout)
#16 68.37        79 - issue-700-allow-multiple-scalar-dim-array-gets (Failed)
#16 68.37        81 - image_fail_test_1 (Timeout)
#16 68.37        82 - image_fail_and_sync_test_1 (Timeout)
#16 68.37        83 - image_fail_and_sync_test_2 (Timeout)
#16 68.37        84 - image_fail_and_sync_test_3 (Timeout)
#16 68.37        85 - image_fail_and_status_test_1 (Timeout)
#16 68.37        86 - image_fail_and_failed_images_test_1 (Timeout)
#16 68.37        87 - image_fail_and_stopped_images_test_1 (Timeout)
#16 68.37        88 - image_fail_and_get_test_1 (Timeout)

Per our discussion, some additional modifications from an OpenCoarrays fork I had worked on as part of my graduate research (and tested with ULFM2 before that fork was merged back into Open MPI) that implemented a subset of Fortran 2018 failed images semantics might be needed---though that OpenCoarrays fork depended on patches to the GFortran front end to support a little extra necessary Fortran 2018 syntax. I'll plan to update you offline regarding this.

@rouson
Copy link
Member

rouson commented Oct 23, 2021

@nathanweeks thanks for the update.

@rouson
Copy link
Member

rouson commented Feb 14, 2022

@nathanweeks any thoughts on whether we should keep this PR open or close it?

@nathanweeks
Copy link
Contributor Author

nathanweeks commented Feb 17, 2022

@rouson as OpenMPI still isn't quite at 5.0.0 yet (https://github.com/open-mpi/ompi/milestone/37), and additional changes to OpenCoarrays (and GFortran) would be needed to implement a useful subset of Fortran 2018 failed images semantics, it might be prudent to scale back this draft PR to include only what's needed to support Open-MPI without failed images. If that sounds OK, I will plan to propose modifications to this PR this weekend. Either way, any guidance is appreciated.

GitHub
Open MPI main development repository. Contribute to open-mpi/ompi development by creating an account on GitHub.

@rouson
Copy link
Member

rouson commented Feb 17, 2022

@nathanweeks I defer to your judgement. That sounds like a good plan, but I don't have a very deep appreciation of the considerations involved or the effort. Feel free to proceed as you see fit. I'll look forward to the contribution.

@nathanweeks
Copy link
Contributor Author

nathanweeks commented Feb 21, 2022

@rouson I'll suggest closing this one for now. The main branch already builds with Open MPI 4.x without failed images support, and failed images support will always be disabled for Open MPI <= 4.x due to cmake code that disables failed images support if relevant fault-tolerance routines don't exist in the MPI implementation (overriding any user-specified cmake configure option CAF_ENABLE_FAILED_IMAGES=TRUE).

It might be worth reconsidering this PR in the future if there's further development on the failed images implementation in OpenCoarrays. And I'm still planning to re-submit GFortran patch for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87939 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87326 after the GCC 13 branch is created (I missed the deadline for GCC 12 due to problems with a couple GFortran tests), which could help facilitate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants