Skip to content

v4.0.x: Update the PML selection/check logic to avoid direct modex "storms" #8408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 26, 2021
Merged

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 21, 2021

As currently written in this release branch, the PML selection "check" logic doesn't guarantee that the caller's PML choice will be checked against only that from MPI_COMM_WORLD rank=0 when a full modex has been performed. This can lead to every process calling "dmodex" to obtain the PML selection of every other process in the job, causing major delay in wireup on first call to communicate.

These cherry-picks contain the updates developed/committed to master after the code in this release branch was brought over to it. One additional cherry-pick was required to cleanly port the code.

bosilca and others added 3 commits January 20, 2021 19:43
With this patch the best PML is selected earlier, before finalizing
the others PML. This provides a simpler mechanism to intercept and
highjack the PML (as done in the monitoring PML)

Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit 668aa15)
(cherry picked from commit 65fbffa)
For direct modex, all procs publish the selected pml module
and then at add_procs pml module for each proc is checked
against every other proc in the add_proc call.
For full modex, there is no change in functionality. Only Rank0
publishes its selected pml, all other procs in the add_proc call
check their selected pml against Rank0.
If pml's do not match, throw error and exit.

Signed-off-by: Dipti Kothari <[email protected]>
(cherry picked from commit 5418cc5)
(cherry picked from commit 5de4423)
Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 56eb572)
@rhc54 rhc54 added this to the v4.0.6 milestone Jan 21, 2021
@rhc54 rhc54 requested review from bosilca and bwbarrett January 21, 2021 03:45
@rhc54 rhc54 self-assigned this Jan 21, 2021
@rhc54 rhc54 marked this pull request as draft January 21, 2021 03:45
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 21, 2021

I have moved this to "draft" status because I believe we need to revisit the PML selection check scheme. Please see #8404 (comment) for an explanation

@rhc54 rhc54 marked this pull request as ready for review January 21, 2021 16:33
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 21, 2021

After conversation, this is good to go!

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 21, 2021

bot:aws:retest

@hppritcha hppritcha merged commit c3fe37d into open-mpi:v4.0.x Jan 26, 2021
@hoopoepg hoopoepg mentioned this pull request Feb 2, 2021
@rhc54 rhc54 deleted the cmr40/pml branch March 18, 2021 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants