Skip to content

osc/rdma, btl: fix two issues with one-sided #9594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

wzamazon
Copy link
Contributor

This PR contains a series of patches that aims to fix

#8983

and

#7830

@wzamazon wzamazon requested review from hjelmn and bwbarrett October 25, 2021 16:56
@wzamazon
Copy link
Contributor Author

tested using ompi-tests/onesided

@@ -64,6 +64,11 @@ OPAL_DECLSPEC int mca_btl_base_param_register(mca_base_component_t *version,
mca_btl_base_module_t *module);
OPAL_DECLSPEC int mca_btl_base_param_verify(mca_btl_base_module_t *module);

OPAL_DECLSPEC int mca_btl_base_am_atomic_32(int32_t *operand, opal_atomic_int32_t *addr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right way to add atomic support to btl/self. You shouldn't have to do anything to use the am interface with btl/self. The osc rdma component should call mca_btl_base_am_rdma_init() on the selected btl, which will fill in these functions. Now, we can (and probably should) have a discussion with Nathan about why the osc rdma component calls am_rdma_init instead of the btl itself, but either way, I don't think this commit is right.

@@ -267,7 +267,7 @@ static int ompi_osc_rdma_component_register (void)
MCA_BASE_VAR_SCOPE_GROUP, &ompi_osc_rdma_btl_names);
free(description_str);

ompi_osc_rdma_btl_alternate_names = "sm,tcp";
ompi_osc_rdma_btl_alternate_names = "self,sm,tcp";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right thing to do. There's no real reason to allow-list alternate BTLs, and it still means that we don't have a good solution for BTLs that aren't in one of the two allowlists. I can understand the first-level allow list (for now), but here we should just handle as many BTLs as it takes to reach everyone in the widow, and any BTL should work (that's the whole point of the AM wrapper).

@@ -83,8 +83,8 @@ int ompi_osc_rdma_new_peer (struct ompi_osc_rdma_module_t *module, int peer_id,

/* find a btl/endpoint to use for this peer */
int ret = ompi_osc_rdma_peer_btl_endpoint (module, peer_id, &module_btl_index, &endpoint);
if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional is almost certainly wrong, but I don't entirely understand Nathan's comments about the local leader optimization being required on uGNI. I'm pretty sure that this is to handle whatever condition causes us to have to worry about the local leader optimization, so we should explicitly get an ack from @hjelmn here and understand what's going on.

@@ -261,6 +261,17 @@ typedef uint8_t mca_btl_base_tag_t;
/* The BTL has active-message based atomics */
#define MCA_BTL_FLAGS_ATOMIC_AM_FOP 0x400000

/* This flag indicates whether the BTL's RDMA operation supports
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know most of this file isn't, but everything in this file should be in doxygen format (so start with /**)

module->state->num_fenced_peers = 0;
OPAL_THREAD_UNLOCK(&(module->lock));
ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module);
if (ret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no implicit cast from int to bool

}

/* for each process in the group increment their number of fenced peers */
for (int i = 0 ; i < num_peers; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be consistent in spacing around ;

/* for each process in the group increment their number of fenced peers */
for (int i = 0 ; i < num_peers; ++i) {
ompi_osc_rdma_peer_t *peer = peers[i];
intptr_t target = (intptr_t) peer->state + offsetof (ompi_osc_rdma_state_t, num_fenced_peers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no space between offseof and argument.

ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module);
if (module->btl_support_remote_completion) {
/* if all selected btls support remote completion, then all RMA operations have finished
* on remote side. A barrier is enough to complete the fence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indenting

* if that is the case, this function will not have been called
*/
assert (!ompi_osc_rdma_peer_local_state (peer));
ret = ompi_osc_rdma_lock_btl_op (module, peer, target, MCA_BTL_ATOMIC_ADD, 1, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is sufficient. BTL operations (including RDMA operations) are not implicitly ordered. An ordering flag must be set on every operation that must be ordered in order for this to be guaranteed to work. TCP doesn't take advantage of reordering (and I don't think OFI does either) so you likely won't see this in testing. but the verbs BTL did have an ordering problem and we haven't audited the other BTLs, so we should really follow the spec).

@wzamazon wzamazon force-pushed the oneside_fix_upstream branch from 5daf595 to 70285a8 Compare November 14, 2021 22:19
@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/37f2b0805c6ce06764bb9c85e9542070

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e62bf6d29d28dbe7485e433e33d4a2ab

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d6cd677f88a124039f111dd524099019

Currently, osc_rdma_query_alternate_btls() select alternate btls
from a pre-defined list of btls osc_rdma_alternate_btl_names.

The patch removed the restriction so any btl can be used as
alternate btl. However, for alternate btls, osc/rdma will disable
their native rdma and atomics implementations and use active message
RDMA/atomics for them. This is because the atomicity across btls
cannot be guranteed.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon wzamazon force-pushed the oneside_fix_upstream branch from 70285a8 to fcdb885 Compare November 15, 2021 00:45
Currently, ompi_osc_rdma_new_peer() can still proceed if no btl endpoint has
been found a new peer. This caused segfault when a process try
to communicated with the peer.

This patch change it to: if a btl endpoint cannot be found for
a peer, ompi_osc_rdma_new_peer() wil not proceed.

Signed-off-by: Wei Zhang <[email protected]>
This patch introduced a new flag MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION,
which is used to indicate whether a btl's RDMA operations support
remote completion.

3 btls support this feature: self, ofi and ugni, thus this patch
added the flags to them.

Signed-off-by: Wei Zhang <[email protected]>
this patch introduce 3 types of btl ordering requirements:

    MCA_BTL_IN_ORDER_RDMA_ATOMICS,
    MCA_BTL_IN_ORDER_SEND,
    MCA_BTL_IN_ORDER_ALL.

Signed-off-by: Wei Zhang <[email protected]>
Currently, active message RDMA/atomics always use MCA_BTL_NO_ORDER
on the btl, even when caller specified ordering requirement. This
patch makes active message RDMA/atomics to honor caller's ordering
requirement.

Signed-off-by: Wei Zhang <[email protected]>
This patch fix the condition of using local leader and
cpu atomics.

The local leader optimization is:

  on each node, a process was designated as the local leader,
  who setup shared memory, and other processes on the same
  node would map their states to local leader's shared memory.

  When a process try to update a peer process's state, the
  process will do that through atomic actions on local leader's
  memory. The peer's state is then updated through shard memory.

The cpu atomics optimizaton is:

  for processes on the same node, use cpu atomics to update peer's
  state (instead of using the selected btl).

Both "local leader" and "cpu atomics" are using different
channels to transfer data and to update peer's state.

This kind of optimizations requires selected BTLs to support
remote completion. Otherwise, there may be data corruption,
because peer's state can be updated before the RDMA operation
is completed on the peer.

However, currently, local leader is used unconditionally.

"cpu atomics" is used when btl support mixed usage of cpu
and NIC atomics.

Both are used wrongly and will cause data corruption.

This patch address the issue by only use "local leader" and
"cpu atomics" if all seleted BTLs support remote completion.

When "local leader" is not used, each process need to have its peers'
state, for which this patch introduced a function gather_peer_state().

This patch then sets peer's state pointer using gathered information,
and use the same endpoint to update state and transfer data.

Signed-off-by: Wei Zhang <[email protected]>
Currently, fence is implemented by two steps:

First, waiting for RDMA operiatons to complete locally.
Second, call coll->barrier on the communicator.

This is correct only if selected BTL support remote completion.
Otherwise, it can happen that when coll->barrier() finished, the remote
side of the RDMA operation has not completed yet.

This patch implemented a different barrier mechanism, which is used
when any of the selected BTL does not support remote completion.

In which case, each process will post an atomic operation to every
peer to increase a counter on the peer through the selected BTL endpoint.
Though wait for its own counter to reach number of peers. This ensures
all previous RDMA operations have completed.

Signed-off-by: Wei Zhang <[email protected]>
@awlauria
Copy link
Contributor

bot:aws:retest

java failure.

@wzamazon
Copy link
Contributor Author

Replaced by 3 PRs:

#9694
#9695
#9696

@wzamazon wzamazon closed this Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants