-
Notifications
You must be signed in to change notification settings - Fork 900
osc/rdma, btl: fix two issues with one-sided #9594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tested using ompi-tests/onesided |
opal/mca/btl/base/base.h
Outdated
@@ -64,6 +64,11 @@ OPAL_DECLSPEC int mca_btl_base_param_register(mca_base_component_t *version, | |||
mca_btl_base_module_t *module); | |||
OPAL_DECLSPEC int mca_btl_base_param_verify(mca_btl_base_module_t *module); | |||
|
|||
OPAL_DECLSPEC int mca_btl_base_am_atomic_32(int32_t *operand, opal_atomic_int32_t *addr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right way to add atomic support to btl/self. You shouldn't have to do anything to use the am interface with btl/self. The osc rdma component should call mca_btl_base_am_rdma_init() on the selected btl, which will fill in these functions. Now, we can (and probably should) have a discussion with Nathan about why the osc rdma component calls am_rdma_init instead of the btl itself, but either way, I don't think this commit is right.
@@ -267,7 +267,7 @@ static int ompi_osc_rdma_component_register (void) | |||
MCA_BASE_VAR_SCOPE_GROUP, &ompi_osc_rdma_btl_names); | |||
free(description_str); | |||
|
|||
ompi_osc_rdma_btl_alternate_names = "sm,tcp"; | |||
ompi_osc_rdma_btl_alternate_names = "self,sm,tcp"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right thing to do. There's no real reason to allow-list alternate BTLs, and it still means that we don't have a good solution for BTLs that aren't in one of the two allowlists. I can understand the first-level allow list (for now), but here we should just handle as many BTLs as it takes to reach everyone in the widow, and any BTL should work (that's the whole point of the AM wrapper).
@@ -83,8 +83,8 @@ int ompi_osc_rdma_new_peer (struct ompi_osc_rdma_module_t *module, int peer_id, | |||
|
|||
/* find a btl/endpoint to use for this peer */ | |||
int ret = ompi_osc_rdma_peer_btl_endpoint (module, peer_id, &module_btl_index, &endpoint); | |||
if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional is almost certainly wrong, but I don't entirely understand Nathan's comments about the local leader optimization being required on uGNI. I'm pretty sure that this is to handle whatever condition causes us to have to worry about the local leader optimization, so we should explicitly get an ack from @hjelmn here and understand what's going on.
@@ -261,6 +261,17 @@ typedef uint8_t mca_btl_base_tag_t; | |||
/* The BTL has active-message based atomics */ | |||
#define MCA_BTL_FLAGS_ATOMIC_AM_FOP 0x400000 | |||
|
|||
/* This flag indicates whether the BTL's RDMA operation supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know most of this file isn't, but everything in this file should be in doxygen format (so start with /**)
module->state->num_fenced_peers = 0; | ||
OPAL_THREAD_UNLOCK(&(module->lock)); | ||
ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module); | ||
if (ret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no implicit cast from int to bool
} | ||
|
||
/* for each process in the group increment their number of fenced peers */ | ||
for (int i = 0 ; i < num_peers; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be consistent in spacing around ;
/* for each process in the group increment their number of fenced peers */ | ||
for (int i = 0 ; i < num_peers; ++i) { | ||
ompi_osc_rdma_peer_t *peer = peers[i]; | ||
intptr_t target = (intptr_t) peer->state + offsetof (ompi_osc_rdma_state_t, num_fenced_peers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no space between offseof and argument.
ret = module->comm->c_coll->coll_barrier(module->comm, module->comm->c_coll->coll_barrier_module); | ||
if (module->btl_support_remote_completion) { | ||
/* if all selected btls support remote completion, then all RMA operations have finished | ||
* on remote side. A barrier is enough to complete the fence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indenting
* if that is the case, this function will not have been called | ||
*/ | ||
assert (!ompi_osc_rdma_peer_local_state (peer)); | ||
ret = ompi_osc_rdma_lock_btl_op (module, peer, target, MCA_BTL_ATOMIC_ADD, 1, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is sufficient. BTL operations (including RDMA operations) are not implicitly ordered. An ordering flag must be set on every operation that must be ordered in order for this to be guaranteed to work. TCP doesn't take advantage of reordering (and I don't think OFI does either) so you likely won't see this in testing. but the verbs BTL did have an ordering problem and we haven't audited the other BTLs, so we should really follow the spec).
5daf595
to
70285a8
Compare
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/37f2b0805c6ce06764bb9c85e9542070 |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/e62bf6d29d28dbe7485e433e33d4a2ab |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/d6cd677f88a124039f111dd524099019 |
Currently, osc_rdma_query_alternate_btls() select alternate btls from a pre-defined list of btls osc_rdma_alternate_btl_names. The patch removed the restriction so any btl can be used as alternate btl. However, for alternate btls, osc/rdma will disable their native rdma and atomics implementations and use active message RDMA/atomics for them. This is because the atomicity across btls cannot be guranteed. Signed-off-by: Wei Zhang <[email protected]>
70285a8
to
fcdb885
Compare
Currently, ompi_osc_rdma_new_peer() can still proceed if no btl endpoint has been found a new peer. This caused segfault when a process try to communicated with the peer. This patch change it to: if a btl endpoint cannot be found for a peer, ompi_osc_rdma_new_peer() wil not proceed. Signed-off-by: Wei Zhang <[email protected]>
This patch introduced a new flag MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION, which is used to indicate whether a btl's RDMA operations support remote completion. 3 btls support this feature: self, ofi and ugni, thus this patch added the flags to them. Signed-off-by: Wei Zhang <[email protected]>
this patch introduce 3 types of btl ordering requirements: MCA_BTL_IN_ORDER_RDMA_ATOMICS, MCA_BTL_IN_ORDER_SEND, MCA_BTL_IN_ORDER_ALL. Signed-off-by: Wei Zhang <[email protected]>
Currently, active message RDMA/atomics always use MCA_BTL_NO_ORDER on the btl, even when caller specified ordering requirement. This patch makes active message RDMA/atomics to honor caller's ordering requirement. Signed-off-by: Wei Zhang <[email protected]>
This patch fix the condition of using local leader and cpu atomics. The local leader optimization is: on each node, a process was designated as the local leader, who setup shared memory, and other processes on the same node would map their states to local leader's shared memory. When a process try to update a peer process's state, the process will do that through atomic actions on local leader's memory. The peer's state is then updated through shard memory. The cpu atomics optimizaton is: for processes on the same node, use cpu atomics to update peer's state (instead of using the selected btl). Both "local leader" and "cpu atomics" are using different channels to transfer data and to update peer's state. This kind of optimizations requires selected BTLs to support remote completion. Otherwise, there may be data corruption, because peer's state can be updated before the RDMA operation is completed on the peer. However, currently, local leader is used unconditionally. "cpu atomics" is used when btl support mixed usage of cpu and NIC atomics. Both are used wrongly and will cause data corruption. This patch address the issue by only use "local leader" and "cpu atomics" if all seleted BTLs support remote completion. When "local leader" is not used, each process need to have its peers' state, for which this patch introduced a function gather_peer_state(). This patch then sets peer's state pointer using gathered information, and use the same endpoint to update state and transfer data. Signed-off-by: Wei Zhang <[email protected]>
Currently, fence is implemented by two steps: First, waiting for RDMA operiatons to complete locally. Second, call coll->barrier on the communicator. This is correct only if selected BTL support remote completion. Otherwise, it can happen that when coll->barrier() finished, the remote side of the RDMA operation has not completed yet. This patch implemented a different barrier mechanism, which is used when any of the selected BTL does not support remote completion. In which case, each process will post an atomic operation to every peer to increase a counter on the peer through the selected BTL endpoint. Though wait for its own counter to reach number of peers. This ensures all previous RDMA operations have completed. Signed-off-by: Wei Zhang <[email protected]>
fcdb885
to
465af9a
Compare
bot:aws:retest java failure. |
This PR contains a series of patches that aims to fix
#8983
and
#7830