osc/rdma, btl: fix two issues with one-sided #9594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

wzamazon wants to merge 7 commits into open-mpi:master from wzamazon:oneside_fix_upstream

Contributor

wzamazon commented Oct 25, 2021

This PR contains a series of patches that aims to fix

and

wzamazon requested review from hjelmn and bwbarrett

October 25, 2021 16:56

Contributor Author

wzamazon commented Oct 25, 2021

tested using ompi-tests/onesided

awlauria requested a review from bosilca

October 25, 2021 17:53

wckzhang added bug Severity: blocker Target: main labels

bwbarrett requested changes

View reviewed changes

opal/mca/btl/base/base.h Outdated

                                                             mca_btl_base_module_t *module);
               OPAL_DECLSPEC int mca_btl_base_param_verify(mca_btl_base_module_t *module);
+              OPAL_DECLSPEC int mca_btl_base_am_atomic_32(int32_t *operand, opal_atomic_int32_t *addr,

Member

bwbarrett Oct 27, 2021

I don't think this is the right way to add atomic support to btl/self. You shouldn't have to do anything to use the am interface with btl/self. The osc rdma component should call mca_btl_base_am_rdma_init() on the selected btl, which will fill in these functions. Now, we can (and probably should) have a discussion with Nathan about why the osc rdma component calls am_rdma_init instead of the btl itself, but either way, I don't think this commit is right.

ompi/mca/osc/rdma/osc_rdma_component.c Outdated

@@ @@ -267,7 +267,7 @@ static int ompi_osc_rdma_component_register (void) @@
                                                           MCA_BASE_VAR_SCOPE_GROUP, &ompi_osc_rdma_btl_names);
                   free(description_str);
-                  ompi_osc_rdma_btl_alternate_names = "sm,tcp";
+                  ompi_osc_rdma_btl_alternate_names = "self,sm,tcp";

Member

bwbarrett Oct 28, 2021

I'm not sure this is the right thing to do. There's no real reason to allow-list alternate BTLs, and it still means that we don't have a good solution for BTLs that aren't in one of the two allowlists. I can understand the first-level allow list (for now), but here we should just handle as many BTLs as it takes to reach everyone in the widow, and any BTL should work (that's the whole point of the AM wrapper).

ompi/mca/osc/rdma/osc_rdma_peer.c Show resolved Hide resolved

ompi/mca/osc/rdma/osc_rdma_peer.c

                   /* find a btl/endpoint to use for this peer */
                   int ret = ompi_osc_rdma_peer_btl_endpoint (module, peer_id, &module_btl_index, &endpoint);
-                  if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&

Member

bwbarrett Oct 28, 2021

This conditional is almost certainly wrong, but I don't entirely understand Nathan's comments about the local leader optimization being required on uGNI. I'm pretty sure that this is to handle whatever condition causes us to have to worry about the local leader optimization, so we should explicitly get an ack from @hjelmn here and understand what's going on.

opal/mca/btl/btl.h

@@ @@ -261,6 +261,17 @@ typedef uint8_t mca_btl_base_tag_t; @@
               /* The BTL has active-message based atomics */
               #define MCA_BTL_FLAGS_ATOMIC_AM_FOP 0x400000
+              /* This flag indicates whether the BTL's RDMA operation supports

Member

bwbarrett Oct 28, 2021

I know most of this file isn't, but everything in this file should be in doxygen format (so start with /**)

ompi/mca/osc/rdma/osc_rdma_active_target.c

+                  module->state->num_fenced_peers = 0;
+                  OPAL_THREAD_UNLOCK(&(module->lock));
+                  ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module);
+                  if (ret) {

Member

bwbarrett Oct 28, 2021

no implicit cast from int to bool

ompi/mca/osc/rdma/osc_rdma_active_target.c

+                  }
+                  /* for each process in the group increment their number of fenced peers */
+                  for (int i = 0 ; i < num_peers; ++i) {

Member

bwbarrett Oct 28, 2021

be consistent in spacing around ;

ompi/mca/osc/rdma/osc_rdma_active_target.c

+                  /* for each process in the group increment their number of fenced peers */
+                  for (int i = 0 ; i < num_peers; ++i) {
+                      ompi_osc_rdma_peer_t *peer = peers[i];
+                      intptr_t target = (intptr_t) peer->state + offsetof (ompi_osc_rdma_state_t, num_fenced_peers);

Member

bwbarrett Oct 28, 2021

no space between offseof and argument.

ompi/mca/osc/rdma/osc_rdma_active_target.c

-                  ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module);
+                  if (module->btl_support_remote_completion) {
+                      /* if all selected btls support remote completion, then all RMA operations have finished
+              	 * on remote side. A barrier is enough to complete the fence.

Member

bwbarrett Oct 28, 2021

indenting

ompi/mca/osc/rdma/osc_rdma_active_target.c

+              	 * if that is the case, this function will not have been called
+              	 */
+                      assert (!ompi_osc_rdma_peer_local_state (peer));
+                      ret = ompi_osc_rdma_lock_btl_op (module, peer, target, MCA_BTL_ATOMIC_ADD, 1, true);

Member

bwbarrett Oct 28, 2021

I'm not sure that this is sufficient. BTL operations (including RDMA operations) are not implicitly ordered. An ordering flag must be set on every operation that must be ordered in order for this to be guaranteed to work. TCP doesn't take advantage of reordering (and I don't think OFI does either) so you likely won't see this in testing. but the verbs BTL did have an ordering problem and we haven't audited the other BTLs, so we should really follow the spec).

wzamazon force-pushed the oneside_fix_upstream branch from 5daf595 to 70285a8 Compare

November 14, 2021 22:19

ibm-ompi commented Nov 14, 2021

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/37f2b0805c6ce06764bb9c85e9542070

ibm-ompi commented Nov 14, 2021

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e62bf6d29d28dbe7485e433e33d4a2ab

ibm-ompi commented Nov 14, 2021

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d6cd677f88a124039f111dd524099019


          osc/rdma: adjust osc_rdma_query_alternate_btls()

6e7984d

Currently, osc_rdma_query_alternate_btls() select alternate btls
from a pre-defined list of btls osc_rdma_alternate_btl_names.

The patch removed the restriction so any btl can be used as
alternate btl. However, for alternate btls, osc/rdma will disable
their native rdma and atomics implementations and use active message
RDMA/atomics for them. This is because the atomicity across btls
cannot be guranteed.

Signed-off-by: Wei Zhang <[email protected]>

wzamazon force-pushed the oneside_fix_upstream branch from 70285a8 to fcdb885 Compare

November 15, 2021 00:45

wzamazon added 6 commits

November 15, 2021 00:50


          osc/rdma: do NOT proceed if no endpoint can be found for a peer

d9bdca0

Currently, ompi_osc_rdma_new_peer() can still proceed if no btl endpoint has
been found a new peer. This caused segfault when a process try
to communicated with the peer.

This patch change it to: if a btl endpoint cannot be found for
a peer, ompi_osc_rdma_new_peer() wil not proceed.

Signed-off-by: Wei Zhang <[email protected]>


          btl: introduce flag MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION

f5ab467

This patch introduced a new flag MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION,
which is used to indicate whether a btl's RDMA operations support
remote completion.

3 btls support this feature: self, ofi and ugni, thus this patch
added the flags to them.

Signed-off-by: Wei Zhang <[email protected]>


          btl/btl.h: introduce 3 types of btl ordering requirements

e889cdd

this patch introduce 3 types of btl ordering requirements:

    MCA_BTL_IN_ORDER_RDMA_ATOMICS,
    MCA_BTL_IN_ORDER_SEND,
    MCA_BTL_IN_ORDER_ALL.

Signed-off-by: Wei Zhang <[email protected]>


          btl/base: make active message RDMA/atomics honor ordering requirement

11dea3c

Currently, active message RDMA/atomics always use MCA_BTL_NO_ORDER
on the btl, even when caller specified ordering requirement. This
patch makes active message RDMA/atomics to honor caller's ordering
requirement.

Signed-off-by: Wei Zhang <[email protected]>


          osc/rdma: fix the usage condition of local leader and cpu atomics

ded237a

This patch fix the condition of using local leader and
cpu atomics.

The local leader optimization is:

  on each node, a process was designated as the local leader,
  who setup shared memory, and other processes on the same
  node would map their states to local leader's shared memory.

  When a process try to update a peer process's state, the
  process will do that through atomic actions on local leader's
  memory. The peer's state is then updated through shard memory.

The cpu atomics optimizaton is:

  for processes on the same node, use cpu atomics to update peer's
  state (instead of using the selected btl).

Both "local leader" and "cpu atomics" are using different
channels to transfer data and to update peer's state.

This kind of optimizations requires selected BTLs to support
remote completion. Otherwise, there may be data corruption,
because peer's state can be updated before the RDMA operation
is completed on the peer.

However, currently, local leader is used unconditionally.

"cpu atomics" is used when btl support mixed usage of cpu
and NIC atomics.

Both are used wrongly and will cause data corruption.

This patch address the issue by only use "local leader" and
"cpu atomics" if all seleted BTLs support remote completion.

When "local leader" is not used, each process need to have its peers'
state, for which this patch introduced a function gather_peer_state().

This patch then sets peer's state pointer using gathered information,
and use the same endpoint to update state and transfer data.

Signed-off-by: Wei Zhang <[email protected]>


          osc/rdma : fence mechanism for btls do not support remote completion

465af9a

Currently, fence is implemented by two steps:

First, waiting for RDMA operiatons to complete locally.
Second, call coll->barrier on the communicator.

This is correct only if selected BTL support remote completion.
Otherwise, it can happen that when coll->barrier() finished, the remote
side of the RDMA operation has not completed yet.

This patch implemented a different barrier mechanism, which is used
when any of the selected BTL does not support remote completion.

In which case, each process will post an atomic operation to every
peer to increase a counter on the peer through the selected BTL endpoint.
Though wait for its own counter to reach number of peers. This ensures
all previous RDMA operations have completed.

Signed-off-by: Wei Zhang <[email protected]>

wzamazon force-pushed the oneside_fix_upstream branch from fcdb885 to 465af9a Compare

November 15, 2021 00:51

wzamazon added the ⚠️ WIP-DNM! label

Contributor

awlauria commented Nov 16, 2021

bot:aws:retest

java failure.

Contributor Author

wzamazon commented Nov 23, 2021

Replaced by 3 PRs:

#9694
#9695
#9696

wzamazon closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Severity: blocker Target: main ⚠️ WIP-DNM!