Skip to content

[v5.0.x] ompi/dpm: make procs consistent before calling PMIx_Connect() #10564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 15, 2022

Conversation

wzamazon
Copy link
Contributor

ompi_dpm_connect_accept() call PMIx_Connect() to establish
connection, with "procs" as an argument.

PMIx requires "procs" to be consistent accross clients.

When it is used to set up inter-communicator communication,
ompi_dpm_connect_accept() does not maintain the order of
proc in "procs". This is because the function is called
by both MPI_Comm_connect() and MPI_Comm_accept(), and
it always put processes in local communicator in "procs"
first, followed by processes in remote communicator.

However, for caller of MPI_Comm_connect() and MPI_Comm_accept(),
local communicator and remote communicator are different.

This patch fixed the issue by sorting "procs" before
it is used to call PMIx_Connect(), this ensures that "procs"
are consistent accross processes.

Signed-off-by: Wei Zhang [email protected]
(cherry picked from commit 0294303)

ompi_dpm_connect_accept() call PMIx_Connect() to establish
connection, with "procs" as an argument.

PMIx requires "procs" to be consistent accross clients.

When it is used to set up inter-communicator communication,
ompi_dpm_connect_accept() does not maintain the order of
proc in "procs". This is because the function is called
by both MPI_Comm_connect() and MPI_Comm_accept(), and
it always put processes in local communicator in "procs"
first, followed by processes in remote communicator.

However, for caller of MPI_Comm_connect() and MPI_Comm_accept(),
local communicator and remote communicator are different.

This patch fixed the issue by sorting "procs" before
it is used to call PMIx_Connect(), this ensures that "procs"
are consistent accross processes.

Signed-off-by: Wei Zhang <[email protected]>
(cherry picked from commit 0294303)
@wzamazon
Copy link
Contributor Author

back port #10557 to v5.0.x

@hppritcha
Copy link
Member

hmm looks like azure is unhappy
/azp run

@hppritcha
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hppritcha
Copy link
Member

looks like docker issues at mlnx:

Error response from daemon: Get http://rdmz-harbor.rdmz.labs.mlnx/v2/hpcx/ompi_ci/manifests/latest: received unexpected HTTP status: 500 Internal Server Error

@wzamazon
Copy link
Contributor Author

bot:mellanox:retest

@wzamazon
Copy link
Contributor Author

The mellanox CI issue seems to be persistent. I tried a few times, and it always fail.

@wckzhang
Copy link
Contributor

Mellanox CI was also having trouble last week. @artemry-nv can you take a look?

@artemry-nv
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@artemry-nv
Copy link

Mellanox CI was also having trouble last week. @artemry-nv can you take a look?

Some issues with the container registry - working on this.

@wzamazon
Copy link
Contributor Author

Mellanox CI fixed itself and passed.

@awlauria Can you merge this PR?

@awlauria awlauria merged commit 539dece into open-mpi:v5.0.x Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants