Skip to content

v5.0.x: request: correctly handle MPI_COMM_NULL #12984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

janjust
Copy link
Contributor

@janjust janjust commented Dec 17, 2024

ompi_request_check_same_instance(): ignore requests whose communicator is ompi_mpi_comm_null. That can occur when pml/ucx is used and mca_pml_ucx_completed_request_init() was invoked.

Thanks Christian Huettig for the report and helping with the troubleshooting.

Refs. #12942

Signed-off-by: Gilles Gouaillardet [email protected]
(cherry picked from commit 607f3a4)

@bosilca
Copy link
Member

bosilca commented Dec 17, 2024

I'm not sure this is the right fix. From a standard perspective MPI_COMM_NULL is nothing that can be used for communications, so associating it with a request is a logical fallacy.

@jsquyres
Copy link
Member

I'm not sure this is the right fix. From a standard perspective MPI_COMM_NULL is nothing that can be used for communications, so associating it with a request is a logical fallacy.

I agree -- @ggouaillardet Can you explain why this is correct?

Taking the liberty to moving this to Draft just so that we don't merge it before getting the explanation. Thanks!

@jsquyres jsquyres marked this pull request as draft December 17, 2024 15:21
@bosilca
Copy link
Member

bosilca commented Dec 17, 2024

The issue does not come from @ggouaillardet fix, he only tries to address the segfault we have right now. I did some digging last week, but the PR was merged before I had time to comment on it.

The original code in UCX was using MPI_COMM_WORLD for the early completed isend requests, because the user cannot retrieve the communicator anyway, and any valid communicator is good enough. Sidenote: This would break the FT layer but that's another discussion. When @hppritcha merged the session changes he changed the MPI_COMM_WORLD (which has a dedicated instance in our implementation) to MPI_COMM_NULL, which does not have an associated instance.

I'm thorn regarding the correct approach. I understand that associating the completed request with a predefined communicator removes the need for any bookkeeping for that request, so it is performance related. But whatever we are saving for such peculiar requests, we will be losing by adding the extra branches to all instance checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants