-
Notifications
You must be signed in to change notification settings - Fork 900
osc/rdma, btl/tcp: fix various issues with osc/rdma #9400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma, btl/tcp: fix various issues with osc/rdma #9400
Conversation
Will have to look through this. We use a single state endpoint per node with real RDMA networks for both correctness and performance reasons. Does this PR change all cases or just the multi-btl case? |
I can confirm it works on a single node. However, running across two nodes I see the same issue pop up again at > 2 ranks. That is a huge improvement, and might be a separate issue related to #7830 |
e31a4de
to
b95b89a
Compare
@awlauria The 2 nodes failure is because local leader is also used on multiple nodes communication. To give an example, say we have 4 ranks split on 2 nodes: rank 0 and 1 are on node 0, rank 2 and 3 are on node 1. rank 0 and rank 2 are local leader. rank 2 is going to do one-sided operation on rank 1, for which rank2 communicate with rank 1 directly. After rank 2 is done, rank 2 must update counter on rank 1. Now instead of communicate directly with rank 1, rank 2 communicate with rank 0 (who is leader of rank 1) to update a memory region on rank 0, which maps to the counter on rank 1. I just updated the PR to not use local leader for multi node, so the new PR should work for tcp on 2 nodes. However, I believe removing local leader would cause performance degradation for existing btls that can use (and benefit) from it, which is undesirable. Because btl/tcp is the outlier here (that does not work with local leader) , I think it is better to introduce a flag to indicate a btl cannot use local leader, and have btl/tcp toggle it on. Meanwhile, instead of removing the local leader with the new PR, we add it alongside with local leader, and only use the new PR with btl/tcp (identified by the new flag). @hjelmn does the plan sound right to you? |
The MCA_BTL_ATOMIC_SUPPORTS_GLOB flag indicates whether a btl atomics can be mixed with CPU's atomics operations. The active message RDMA does support this capability, therefore this patch unset the flag for active message RDMA. btl/ofi's RDMA does support this capability because it requested delivery complete from libfabric, therefore this patch set the flag for btl/ofi. Signed-off-by: Wei Zhang <[email protected]>
b95b89a
to
ef68652
Compare
I realized that I do not need to introduce new btl flag, and I can use existing flags that indicate active message put and atomics are used, because this the case local leader optimization does not work. @hjelmn The latest version of PR limits the change to when active message put/atomics is used. Please take a look! @awlauria Would you please test the latest version of the PR? It should work both on single node and on multiple nodes. Thank you! |
ef68652
to
0e5d66a
Compare
The Pull Request Build Checker failure is because some git checkout issue.
Is there a way to re-trigger the test? |
bot:aws:retest |
The flag "using_cpu_atomics" indicates whether CPU atomcis can be use to update counters for osc/rdma peers on the same instance. This patch fixed two bugs in the way "using_cpu_atomics" is evaluated in allocate_state_shared(): First, current code used "btl->btl_flags", which is a typo. The right fla to use is "btl->btl_atomic_flags". Second, current code always set "using_cpu_atomics" to true on single node, which is not right because even on single node, the selected btl does not necessary support mixed use of atomics. Signed-off-by: Wei Zhang <[email protected]>
osc_rdma_new_peer() creates a new peer. It should fail when the following 3 conditions are all met: 1. cannot find an endpoint 2. the selected btl does not support atomic global 3. the peer is not self However, the current code fails when either 2 or 3 is met. This patch fixed it. Signed-off-by: Wei Zhang <[email protected]>
The local leader optimization means that: on each node, a process was designated as the local leader, who setup shared memory, and other processes on the same node would map their states to local leader's shared memory. When a process try to update a peer process's state, the process will do that through atomic actions on local leader's memory. The peer's state is then updated through shard memory. Essentially, using local leader optimization means two different channels are used to transfer data and to update peer's update. This optimization is incorrect for BTL that uses active message RDMA . Active message RDMA uses send/receive to emulate put and atomics, and its put implementation does not delivery complete, e.g. when the initiator got completion for a put action, it only means data has been sent. it does not mean the data has been delivered to the target buffer. Therefore, if peer's state is updated through a different communication channel, it can happen that peer's state is updated before the put action is completed on the peer, which will cause data corruption. This patch made the change that: for active message RDMA, peer's state is updated using the same channel data is transferred (e.g diabling the local leader optimization). To achieve that, each process need to have the pointer to each peer's state, for which this patch introduced a function gather_peer_state(). Note because active message RDMA does not use memory registration, the state_handle is not gathered. This patch then sets peer's state pointer using gathered information, and use the same endpoint to update data and transfer data. Signed-off-by: Wei Zhang <[email protected]>
Current shm file name used by osc/rdma is not unique when multiple communicator are on the same instance. This patch add the local leader's pid to the file name to address the issue. Signed-off-by: Wei Zhang <[email protected]>
Currently, each peer keep a state_btl_index and a data_btl_index. The btl_index is used to retrive btl used for data transfer and state updating, by calling function mpi_osc_rdma_selected_btl(). This patch simplify the code by directly storing the pointer to the btl inside peer, thus bypassing the call to mpi_osc_rdma_selected_btl(). The function mpi_osc_rdma_selected_btl() is then removed. Signed-off-by: Wei Zhang <[email protected]>
ompi_osc_rdma_peer_btl_endpoint() is used to select btl and endpoint to communicate with a peer. This patch added a change to ompi_osc_rdma_peer_btl_endpoint() that: if no btl/endpoint has been selected for self communication, and if bml has the btl/self, then use btl/self for self communication. It also made a change to ompi_osc_rdma_new_peer(): Currently if no btl can be found and the peer is self, the function still continues. This patch made the function to fail in this case, because the ability to do self communication is essential for osc/rdma. Signed-off-by: Wei Zhang <[email protected]>
The flag MCA_BTL_ATOMIC_SUPPORTS_GLOB indicates that a btl supports mix use CPU atomics with its own atomics. The btl/self supports this because for btl/self the two are the same thing. Therefore this patch added the flag to btl_atomic_flags of btl/self. Signed-off-by: Wei Zhang <[email protected]>
Currently, use_cpu_atomics is a variable in osc_rdma_module. However, whether CPU atomics can be used is a peer level decision. So this patch remove use_cpu_atomics from osc_rdma_module, and evaluate it for each peer. Signed-off-by: Wei Zhang <[email protected]>
0e5d66a
to
5d9e5a2
Compare
@wzamazon Is this PR ready for review, or are you still iterating on it? |
It is ready for review. |
I am still having some errors when running ompi-tests/onesided tests, and I am investigating. But commits in this PR are necessary to fix existing issues regardless what the investigation turned out. |
@gpaulsen I had a discussion with @bwbarrett the current approach of using btl/self is not desirable, so I will try a different (more general) approach. Please hold off reviewing. |
@@ -393,7 +393,8 @@ mca_btl_ofi_module_t *mca_btl_ofi_module_alloc(int mode) | |||
module->super.btl_flags |= MCA_BTL_FLAGS_ATOMIC_FOPS | MCA_BTL_FLAGS_ATOMIC_OPS | |||
| MCA_BTL_FLAGS_RDMA; | |||
|
|||
module->super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_ADD | MCA_BTL_ATOMIC_SUPPORTS_SWAP | |||
module->super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right, and the commit comment for this change is definitely wrong. The EFA provider can guarantee GLOB not because of the delivery complete request, but because atomics are implemented in software. For providers that implement atomics in hardware, GLOB may not be supportable.
@@ -1114,7 +1114,7 @@ int mca_btl_base_am_rdma_init(mca_btl_base_module_t *btl) | |||
|
|||
/* emulated RDMA atomics can support the full range of atomics. for | |||
* now only a handful are supported. */ | |||
btl->btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB | MCA_BTL_ATOMIC_SUPPORTS_CSWAP | |||
btl->btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_CSWAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow; something else is missing here. The AM atomics are consistent with the CPU atomics, since the AM atomics are just implemented as CPU atomics. It's possible that some completion semantic is not correct for the OSC component, but that's a different problem.
Whatever the reason, someone is going to come to the same conclusion as I did and re-insert the GLOB flag. So once we're done figuring out the missing semantic that led you here, we need to update the comment on line 1116.
ompi/mca/osc/rdma/osc_rdma_peer.c
Outdated
if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) && | ||
peer_id == ompi_comm_rank (module->comm)))) { | ||
if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && | ||
!(module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming your earlier patch was applied, is this even somewhere we can get? I'm not sure why this is a run-time test instead of an assert?
* local leader to update peer's state through shared memory region. | ||
* BTLs that uses active message RDMA cannot support such optimization, | ||
* because active message RDMA uses send/receive to emulate put and | ||
* atomics, so the atomcis and RMA operation must be through the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right solution to the problem. The problem with the local leader optimization is that it assume ordering that AM doesn't give, but it's unclear to me that the entire implementation of OSC RDMA can deal with the AM ordering guarantees. I think we need to step back and answer the ordering guarantees question before adding this complexity.
return OMPI_ERR_OUT_OF_RESOURCE; | ||
} | ||
|
||
ret = module->comm->c_coll->coll_allgather(&module->state, sizeof(uintptr_t), MPI_BYTE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the entire component isn't heterogeneous safe, but it's sad to see this code. It's not hard to use MPI's datatype engine to do the conversions, rather than sending a byte stream.
return OMPI_ERR_UNREACH; | ||
} | ||
if (module->use_local_leader) { | ||
/* each node is responsible for holding a part of the rank -> node/local rank mapping array. this code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull this out into its own function
mca_osc_rdma_component.backing_directory, ompi_process_info.nodename, | ||
OMPI_PROC_MY_NAME->jobid, ompi_comm_get_cid(module->comm)); | ||
OMPI_PROC_MY_NAME->jobid, ompi_comm_get_cid(module->comm), getpid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message for this change talks about multiple communicators on teh same instance, but we're adding a pid. I'm not sure how that helps?
/** index into BTL array */ | ||
uint8_t state_btl_index; | ||
/** btl used for rdma */ | ||
struct mca_btl_base_module_t *data_btl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference in the size of the ompi_osc_rdma_peer_t structure with this change and how does that compare with the overall size of the structure? I think it's going to be 12 bytes bigger, but I may be getting the padding / alignment wrong, because I did it in my head. While 12 bytes isn't much, there is one of these structures per peer, and that can be millions of peers. Now, I'd argue that this structure is way too big anyway, but ignoring that, I'm pretty sure this was a space optimization and we should think about the implications of that.
@@ -73,7 +73,19 @@ static int ompi_osc_rdma_peer_btl_endpoint (struct ompi_osc_rdma_module_t *modul | |||
} | |||
} | |||
|
|||
/* unlikely but can happen when creating a peer for self */ | |||
if (peer_id == ompi_comm_rank (module->comm)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really concerned about this approach. I suppose it works, but it feels like the right place for this decision is during the initial btl selection logic.
@@ -107,6 +107,7 @@ static int mca_btl_self_component_register(void) | |||
mca_btl_self.btl_rdma_pipeline_frag_size = INT_MAX; | |||
mca_btl_self.btl_min_rdma_pipeline_size = 0; | |||
mca_btl_self.btl_flags = MCA_BTL_FLAGS_RDMA | MCA_BTL_FLAGS_SEND_INPLACE | MCA_BTL_FLAGS_SEND; | |||
mca_btl_self.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me, since the self BTL doesn't support atomics?
@wzamazon sorry - was out these past few days, and unable to test before I left. Will try it out today or tomorrow, thanks! |
This PR need some rework from beginning. Closing for now. |
This PR addresses #8983 and #7830