Skip to content

Export/ulfm to ompi5 #7740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 10, 2021
Merged

Conversation

abouteiller
Copy link
Member

@abouteiller abouteiller commented May 15, 2020

Features

This implementation conforms to the User Level Failure Mitigation (ULFM)
MPI Standard draft proposal. The ULFM proposal is developed by the MPI
Forum's Fault Tolerance Working Group to support the continued operation of
MPI programs after crash (node failures) have impacted the execution. The key
principle is that no MPI call (point-to-point, collective, RMA, IO, ...) can
block indefinitely after a failure, but must either succeed or raise an MPI
error.

This implementation produces the three supplementary error codes and five
supplementary interfaces defined in the communicator section of the
[http://fault-tolerance.org/wp-content/uploads/2012/10/20170221-ft.pdf]
(ULFM chapter) standard draft document.

  • MPIX_ERR_PROC_FAILED when a process failure prevents the completion of
    an MPI operation.
  • MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking
    wildcard source receive has failed.
  • MPIX_ERR_REVOKED when one of the ranks in the application has invoked the
    MPI_Comm_revoke operation on the communicator.
  • MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on
    the communicator at all ranks.
  • MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new
    communicator where dead processes in comm were removed.
  • MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault
    tolerant allreduce operation) on flag (with the operation bitwise AND).
  • MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of
    currently acknowledged failed processes.
  • MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends
    to ignore the effect of currently known failures on wildcard receive
    completions and agreement return values.

Supported Systems

There are several MPI engines available in Open MPI,
notably, PML "ob1", "cm", "ucx", and MTL "ofi", "portals4", "psm2".
At this point, only "ob1" is adapted to support fault tolerance.

"ob1" uses BTL ("Byte Transfer Layer") components for each supported
network. "ob1" supports a variety of networks that can be used in
combination with each other. Collective operations (blocking and
non-blocking) use an optimized implementation on top of "ob1".

  • Loopback (send-to-self)
  • TCP
  • UCT (InfiniBand)
  • uGNI (Cray Gemini, Aries)
  • Shared Memory (FT supported w/CMA and XPmem; KNEM is untested)
  • Tuned and non-blocking collective communications

More details available in README.FT.ULFM.md

Performance tests

For p2p operations

Each solid line represents an IMB-MPI1 run with Open MPI master. ucx/ib between ranks located at two different nodes, ucx/sm between two ranks located on sibling cores on a single node. The performance of pr7740 is overlayed with dashed lines and squares/circles.

Left column is latency (lower is better); Right column is bandwidth (higher is better).

Graph rows alternate between the default setting on that machine (pml=ucx), for validation only, and mode that supports runtime ft=on (pml=ob1/uct).

The patch introduces no substantive changes to the UCX pml pathway, so we expect no difference (which is what we see), and we cannot run with ft=on.

The PML ob1 pathway supports FT, and we see no performance difference with ft=off, and maybe a very slight effect with ft=on.

pr7740-p2p

For coll operations

In this set of tests we run a selection of IMB collective benchmarks. We present only the pml=ob1 case (ucx case shows no difference). sm+cma is used for communication between cores, uct/mlx5 is used between nodes, map-by slot (all other settings default, thus coll=tuned).

The results are a bit more variable as is illustrated by the fact that it happens that ft=on is faster in some instances than ft=off or master. We would need more statistics to cleanup the look of the curves, but overall the performance impact appears non-measurable.

pr7740-coll

Aries systems

This is a set of tests that compares ULFM with its upstream master (an older version) on Cori (ugni). This graph aggregates a log of data as each violin point is the aggregate of hundred of IMB runs. Blue is master reference, orange is ulfm ft=on. We present the distribution of multiple runs per message size.

Again, the difference is minuscule, if any.

UvO-pingpongbw

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented May 15, 2020

yikes - can you squash this down?

@abouteiller abouteiller force-pushed the export/ulfm-to-ompi5 branch from 7844e02 to 73b6fa4 Compare May 15, 2020 01:24
@jjhursey
Copy link
Member

It's my understanding that this PR is the OMPI side companion of openpmix/prrte#542 in PRRTE.

@abouteiller abouteiller force-pushed the export/ulfm-to-ompi5 branch 2 times, most recently from cf886d5 to 38a392d Compare May 15, 2020 19:20
@bosilca bosilca added this to the v5.0.0 milestone May 19, 2020
@abouteiller
Copy link
Member Author

bot:retest

@gpaulsen
Copy link
Member

PRRTE has requested that the Open MPI community inform them if we're taking this for v5.0 or if we'll hold it for post v5.0.

Everyone, please review this PR, and we can discuss and make a decision on 5/26 web-ex.

@awlauria awlauria linked an issue May 19, 2020 that may be closed by this pull request
@abouteiller abouteiller force-pushed the export/ulfm-to-ompi5 branch from 38a392d to 3b1bd67 Compare May 27, 2020 16:47
@abouteiller abouteiller force-pushed the export/ulfm-to-ompi5 branch from 3b1bd67 to 22522b0 Compare July 30, 2020 17:33
@nathanweeks
Copy link
Contributor

Is this PR still slated to make it into 5.0?

@bosilca
Copy link
Member

bosilca commented Aug 6, 2020

yes. the conlicts will be removed asap.

@abouteiller abouteiller force-pushed the export/ulfm-to-ompi5 branch from 22522b0 to 99aa09b Compare August 10, 2020 15:18
@gpaulsen gpaulsen requested a review from devreal September 1, 2020 15:12
devreal
devreal previously requested changes Sep 3, 2020
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor changes/comments/questions/missing braces. I cannot vouch for the conceptual correctness as I haven't tried to fully understand the algorithms (there is only so much time available) and I only quickly gazed over the build system and Fortran changes as I'm not familiar with either. Maybe someone else wants to look at these parts...

@abouteiller
Copy link
Member Author

Thanks Joseph that's very thorough. I'll mark the items resolved as I go through it.

abouteiller added a commit to abouteiller/ompi-aurelien that referenced this pull request Sep 15, 2020
…iew.

670301e7...69eb9cc5

Protect and warn against using fault-tolerance with malfunctionning
select-based libevent.

Signed-off-by: Aurélien Bouteiller <[email protected]>

Remove the disable-build implicit platform for FT

Make sure that we can run failure-free with PML that do not support FT

Signed-off-by: Aurelien Bouteiller <[email protected]>

ununsed variable in errhandler

Signed-off-by: Aurelien Bouteiller <[email protected]>

Update the ft-mpi tune file for prte detector default

Signed-off-by: Aurelien Bouteiller <[email protected]>

Remove debug function OMPI_Comm_failure_inject

Signed-off-by: Aurelien Bouteiller <[email protected]>

Do not leak the port_string in spawn error cases

Signed-off-by: Aurelien Bouteiller <[email protected]>

Rbcast n2 would not send to the remote_group

Signed-off-by: Aurelien Bouteiller <[email protected]>

Check for NULL flag arg in agree/is_revoked

Signed-off-by: Aurelien Bouteiller <[email protected]>

Substitue MPI_Wtime with PMPI_Wtime.

Signed-off-by: Aurelien Bouteiller <[email protected]>

General cleanup
Address comments from Joseph Schuchart:
open-mpi#7740 (review)

Signed-off-by: Aurelien Bouteiller <[email protected]>

Remove dead code in detector

Signed-off-by: Aurelien Bouteiller <[email protected]>

ftagree: Rewrite obfuscated for loops as such and explain why, when it cannot
be done

Signed-off-by: Aurelien Bouteiller <[email protected]>

Use normal datatype operations in ETA agree

Signed-off-by: Aurelien Bouteiller <[email protected]>

Align the 'proc padding' by moving the 'proc active' fiedl as it breaks
openshmem otherwise.

Signed-off-by: Aurelien Bouteiller <[email protected]>

Serialize access to the 'all_failed_procs' group, there was a risk of
concurrent accesses to outdated aliases from different threads.

Signed-off-by: Aurelien Bouteiller <[email protected]>

Errhandler should not call progress recursively; we did add a recursive call by mistake when activating PMIx event notification

Signed-off-by: Aurelien Bouteiller <[email protected]>

Merge branch 'master' into export/ulfm-to-ompi5-expanded

do not modify the loop indice in the inner loop

Signed-off-by: Aurelien Bouteiller <[email protected]>

Uninitialized variable in ft_rbcast was harmless

Signed-off-by: Aurelien Bouteiller <[email protected]>

Do the cancel-wait error path only for errors that are garanteed to
complete the wait. Other error types just request_free.

Signed-off-by: Aurelien Bouteiller <[email protected]>

convert proc-failed any-source errors to single source in collectives

Signed-off-by: Aurelien Bouteiller <[email protected]>

Merge branch 'master' into export/ulfm-to-ompi5-expanded
@ibm-ompi

This comment has been minimized.

@ibm-ompi

This comment has been minimized.

@jjhursey
Copy link
Member

bot:ibm:pgi:retest

@ibm-ompi

This comment has been minimized.

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/9ee672eee7b1160d304dd4948f237c14

@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d636bd89cf5c13641c881fa46943c9c0

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0bd7744a554a5e9f298db038a01de7b8

@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/1c5d49907771e1fd8ad92fe3737af72d

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/b875bd150564e4a03b71b59ba437fc06

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/777e273b28cb62aed863a6a32be927a7

@abouteiller
Copy link
Member Author

CI runtime failure is due to missing openpmix/prrte#643 in prte.

@jsquyres
Copy link
Member

jsquyres commented Jan 5, 2021

Discussion on 5 Jan 2021 webex:

  • It would be great to get this in to master as soon as possible.
    • We need time to stabilize this code in master. So: the sooner, the better
    • Whether OMPI takes ULFM or not also has an impact on PMIx and/or PRRTE. So we need to let our peer communities know what we're going to do.
  • Need code reviews from people before this is merged. You don't necessarily to have to review all the internals of ULFM (if you don't want to); emphasis should definitely be placed on making sure ULFM doesn't negatively non-ULFM OMPI/OSHMEM runs.
  • This PR has been here for months. If there are no meaningful issues by 26 Jan 2021, this PR can be merged.

@abouteiller abouteiller force-pushed the export/ulfm-to-ompi5 branch from bd8c781 to 77b2dc7 Compare January 6, 2021 17:49
@hppritcha
Copy link
Member

bot:ompi:retest

@hppritcha
Copy link
Member

I'm getting lots of warnings when I try to build this using gcc 4.8.5 on a RHEL7 system:

In file included from ../../../../../ompi/group/group.h:37:0,
                 from pgroup_compare_f.c:25:
../../../../../ompi/proc/proc.h:87:28: warning: redefinition of typedef 'ompi_proc_t' [-Wpedantic]
 typedef struct ompi_proc_t ompi_proc_t;
                            ^
In file included from ../../../../../ompi/mpi/fortran/mpif-h/prototypes_mpi.h:53:0,
                 from ../../../../../ompi/mpi/fortran/mpif-h/bindings.h:81,
                 from pgroup_compare_f.c:24:
../../../../../ompi/errhandler/errhandler.h:457:28: note: previous declaration of 'ompi_proc_t' was here
 typedef struct ompi_proc_t ompi_proc_t;
                            ^
In file included from ../../../../../ompi/group/group.h:37:0,
                 from pgroup_free_f.c:25:
../../../../../ompi/proc/proc.h:87:28: warning: redefinition of typedef 'ompi_proc_t' [-Wpedantic]
 typedef struct ompi_proc_t ompi_proc_t;
                            ^
In file included from ../../../../../ompi/mpi/fortran/mpif-h/prototypes_mpi.h:53:0,
                 from ../../../../../ompi/mpi/fortran/mpif-h/bindings.h:81,
                 from pgroup_free_f.c:24:
../../../../../ompi/errhandler/errhandler.h:457:28: note: previous declaration of 'ompi_proc_t' was here
 typedef struct ompi_proc_t ompi_proc_t;
                            ^
In file included from ../../../../../ompi/group/group.h:37:0,
                 from pgroup_excl_f.c:25:
../../../../../ompi/proc/proc.h:87:28: warning: redefinition of typedef 'ompi_proc_t' [-Wpedantic]
 typedef struct ompi_proc_t ompi_proc_t;
                            ^
In file included from ../../../../../ompi/mpi/fortran/mpif-h/prototypes_mpi.h:53:0,
                 from ../../../../../ompi/mpi/fortran/mpif-h/bindings.h:81,

@hppritcha
Copy link
Member

Did some testing using ompi-tests/ibm and things look nominal. Would be nice to get the compiler warnings fixed before merging this PR. Also would be nice to decide whether to remove the --no-orte OMPI Jenkins CI test or else modify so that it passes before this PR is merged.

@@ -595,6 +595,9 @@ enum {
MPI_WIN_DISP_UNIT,
MPI_WIN_CREATE_FLAVOR,
MPI_WIN_MODEL,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend removing lines 598 and 599 as unneeded.

@jjhursey
Copy link
Member

jjhursey commented Jan 26, 2021

Here is the CI error for the "--no-orte" (as mentioned this aliases to "--no-prrte")

configure:37519: ===== configuring 3rd-party/openpmix =====
configure:37708: running /bin/bash ./configure --disable-option-checking '--prefix=/home/ubuntu/workspace/open-mpi.build.autogen_options/AUTOGEN_OPTIONS/--no-orte/install' --without-tests-examples --disable-pmix-binaries --disable-pmix-backward-compatibility --disable-visibility --with-libevent=cobuild --with-prte-extra-lib=" /home/ubuntu/workspace/open-mpi.build.autogen_options/AUTOGEN_OPTIONS/--no-orte/src/3rd-party/libevent-2.1.12-stable/libevent.la /home/ubuntu/workspace/open-mpi.build.autogen_options/AUTOGEN_OPTIONS/--no-orte/src/3rd-party/libevent-2.1.12-stable/libevent_pthreads.la" 'CPPFLAGS= -I/home/ubuntu/workspace/open-mpi.build.autogen_options/AUTOGEN_OPTIONS/--no-orte/src/3rd-party/libevent-2.1.12-stable -I/home/ubuntu/workspace/open-mpi.build.autogen_options/AUTOGEN_OPTIONS/--no-orte/src/3rd-party/libevent-2.1.12-stable/include' --cache-file=/dev/null --srcdir=.
configure:37728: ===== done with 3rd-party/openpmix configure =====
configure:39687: error: Requested enabling fault-tolerance and using external launcher, but external PRTE doesn't support resilience; you can either use the internal PRTE, recompile the external PRTE with fault-tolerance, or disable fault-tolerance. Aborting.

@abouteiller
Copy link
Member Author

Did some testing using ompi-tests/ibm and things look nominal. Would be nice to get the compiler warnings fixed before merging this PR. Also would be nice to decide whether to remove the --no-orte OMPI Jenkins CI test or else modify so that it passes before this PR is merged.

Issue when using external prte has been resolved (ft will get compile-disabled if it was auto-enabled, error at configure is requested with explicit --with-ft configure flag).

The --no-orte flag is useful, but should be renamed --no-internal-prte

Launching with fault tolerance has been simplified (see prte openpmix/prrte#726)

@rhc54
Copy link
Contributor

rhc54 commented Jan 28, 2021

The --no-orte flag is useful, but should be renamed --no-internal-prte

There appears to be a misunderstanding of the --no-orte flag. I agree it should be renamed, but it probably should be changed to --no-rte. It specifically is meant to build OMPI without any RTE to support environments that only want direct launch.

We specifically stated in telecons that OMPI did not want the ability to build against an external PRRTE, and @bwbarrett wrote the 3rd-party configure code accordingly. Since that time, the downstream packagers have indicated they plan to release PRRTE separately, so we might want to revisit that decision.

For now, however, --no-orte should completely disable the RTE, not link us against an external one.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had few minor comments @abouteiller and myself need to address.

The historical repositories contain the full history and
attribution and are available from
  https://bitbucket.org/icldistcomp/ulfm2/src/ulfm/
and prior
  https://github.com/ICLDisco/ulfm-legacy

Signed-off-by: Aurelien Bouteiller <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Josh Hursey <[email protected]>
Signed-off-by: Thomas Herault <[email protected]>
Signed-off-by: Wesley Bland <[email protected]>
Signed-off-by: Nuria Losada <[email protected]>
Signed-off-by: Nathan T. Weeks <[email protected]>

Squashed commit of the following:

commit 262a7f750fc9a86cc936aa59b535936fb3406db6
Author: Aurelien Bouteiller <[email protected]>
Date:   Mon Feb 8 11:03:35 2021 -0500

    Fix a case where a send-request to a failed process would be started
    without an endpoint (issue
    open-mpi#7740 (comment))

    Signed-off-by: Aurelien Bouteiller <[email protected]>

  ...

commit b8fbc4e200f8acd5bfbcd7c493ff04e674e2a023
Merge: c02ef875 724eb49
Author: Aurelien Bouteiller <[email protected]>
Date:   Thu Jan 28 00:05:02 2021 -0500

    Merge branch 'master' into export/ulfm-to-ompi5-expanded

  ...

commit 69ab6b8
Author: Aurélien Bouteiller <[email protected]>
Date:   Thu Feb 18 20:03:12 2016 -0500

    Importing ULFM ompi layer: snapshot of WIP
    Missing BTL and COLL imports.
    Almost compiles w/o --with-ft

Signed-off-by: Aurelien Bouteiller <[email protected]>
@abouteiller abouteiller dismissed devreal’s stale review February 9, 2021 15:26

all changes implemented

@abouteiller
Copy link
Member Author

ready to go.

@bosilca bosilca merged commit 9c374e1 into open-mpi:master Feb 10, 2021
awlauria pushed a commit to awlauria/ompi that referenced this pull request Mar 11, 2021
The historical repositories contain the full history and
attribution and are available from
  https://bitbucket.org/icldistcomp/ulfm2/src/ulfm/
and prior
  https://github.com/ICLDisco/ulfm-legacy

Signed-off-by: Aurelien Bouteiller <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Josh Hursey <[email protected]>
Signed-off-by: Thomas Herault <[email protected]>
Signed-off-by: Wesley Bland <[email protected]>
Signed-off-by: Nuria Losada <[email protected]>
Signed-off-by: Nathan T. Weeks <[email protected]>

Squashed commit of the following:

commit 262a7f750fc9a86cc936aa59b535936fb3406db6
Author: Aurelien Bouteiller <[email protected]>
Date:   Mon Feb 8 11:03:35 2021 -0500

    Fix a case where a send-request to a failed process would be started
    without an endpoint (issue
    open-mpi#7740 (comment))

    Signed-off-by: Aurelien Bouteiller <[email protected]>

  ...

commit b8fbc4e200f8acd5bfbcd7c493ff04e674e2a023
Merge: c02ef875 724eb49
Author: Aurelien Bouteiller <[email protected]>
Date:   Thu Jan 28 00:05:02 2021 -0500

    Merge branch 'master' into export/ulfm-to-ompi5-expanded

  ...

commit 69ab6b8
Author: Aurélien Bouteiller <[email protected]>
Date:   Thu Feb 18 20:03:12 2016 -0500

    Importing ULFM ompi layer: snapshot of WIP
    Missing BTL and COLL imports.
    Almost compiles w/o --with-ft

Signed-off-by: Aurelien Bouteiller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ULFM in master