-
Notifications
You must be signed in to change notification settings - Fork 900
[v5.0.x] ompi/dpm: make procs consistent before calling PMIx_Connect() #10564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v5.0.x] ompi/dpm: make procs consistent before calling PMIx_Connect() #10564
Conversation
ompi_dpm_connect_accept() call PMIx_Connect() to establish connection, with "procs" as an argument. PMIx requires "procs" to be consistent accross clients. When it is used to set up inter-communicator communication, ompi_dpm_connect_accept() does not maintain the order of proc in "procs". This is because the function is called by both MPI_Comm_connect() and MPI_Comm_accept(), and it always put processes in local communicator in "procs" first, followed by processes in remote communicator. However, for caller of MPI_Comm_connect() and MPI_Comm_accept(), local communicator and remote communicator are different. This patch fixed the issue by sorting "procs" before it is used to call PMIx_Connect(), this ensures that "procs" are consistent accross processes. Signed-off-by: Wei Zhang <[email protected]> (cherry picked from commit 0294303)
back port #10557 to v5.0.x |
hmm looks like azure is unhappy |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
looks like docker issues at mlnx: Error response from daemon: Get http://rdmz-harbor.rdmz.labs.mlnx/v2/hpcx/ompi_ci/manifests/latest: received unexpected HTTP status: 500 Internal Server Error |
bot:mellanox:retest |
The mellanox CI issue seems to be persistent. I tried a few times, and it always fail. |
Mellanox CI was also having trouble last week. @artemry-nv can you take a look? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Some issues with the container registry - working on this. |
Mellanox CI fixed itself and passed. @awlauria Can you merge this PR? |
ompi_dpm_connect_accept() call PMIx_Connect() to establish
connection, with "procs" as an argument.
PMIx requires "procs" to be consistent accross clients.
When it is used to set up inter-communicator communication,
ompi_dpm_connect_accept() does not maintain the order of
proc in "procs". This is because the function is called
by both MPI_Comm_connect() and MPI_Comm_accept(), and
it always put processes in local communicator in "procs"
first, followed by processes in remote communicator.
However, for caller of MPI_Comm_connect() and MPI_Comm_accept(),
local communicator and remote communicator are different.
This patch fixed the issue by sorting "procs" before
it is used to call PMIx_Connect(), this ensures that "procs"
are consistent accross processes.
Signed-off-by: Wei Zhang [email protected]
(cherry picked from commit 0294303)