-
Notifications
You must be signed in to change notification settings - Fork 900
osc/rdma: two bug fixes #9358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma: two bug fixes #9358
Conversation
Currenntly, in function allocate_state_shared, "module->use_memory_registation" is set to false when all MPI ranks are on same instances (local_size == global_size). This is harmful when there is only one btl in use, in which case the selected btl should determine whether memory registration should be used. For example, btl/ofi uses memory registration even on same instance. This is unnecessary when three are two btls in use, in which case btls that uses memory registration have been excluded in function ompi_osc_rdma_query_alternate_btls. Therefore, this commit removes the setting of module->use_memory_registration in allocate_state_shared. Signed-off-by: Wei Zhang <[email protected]>
In function "allocate_state_shared", the peer->state_endpoint was copied from 1st peer (a.k.a local_leader). However, state_endpoint and state_btl_index of the 1st peer was not set, causing all peers' state_endpoint being NULL. This patch addresses the issue by setting 1st peer's state_endpoint and state_btl_index from its data_endpoint and data_btl_index. Signed-off-by: Wei Zhang <[email protected]>
The GitHub Action CI failure is because I did not add a description in the PR. After I did that, the GitHub Action CI passed. |
Is there anything else needed for this PR to get merged? |
I'd like @hjelmn to comment in case I was totally off in my review, even if he doesn't do a full review. |
@hjelmn Do you have time to take a look? |
@hjelmn Can you please take a look at this PR? |
Thanks! I will open a PR to back port to v5.0.x. |
fix two bugs in
allocate_state_shared
in osc/rdma/osc_rdma_component.c