Skip to content

PMIx_Fences - remove unneeded ones during MPI initialization #11305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 21, 2023

Conversation

hppritcha
Copy link
Member

This patch removes redundant PMIx Fences in the initialization procedure for MPI when using the World Process Model (WPM). See chapter 11 sections 2 and 3 of the MPI-4 standard for a discussion of the WPM and new Sessions model.

The patch does, however, require that what should have been a local operation to support initialization of an MPI session, into a global one. Note this does not disable the sessions feature but just restricts when it will work at this point to use cases that are similar to MPI initialization using the WPM.

Refactoring to make ompi_mpi_instance_init_common purely local will require changes that would be too impactive for the current state of the 5.0.0 release cycle. See issue #11239.

This patch also fixes up the timings reported when building using the timing infrastructure:

mpirun -np 8 ./ring_c
------------------ ompi_mpi_init ------------------
-- [opal_init_core.c:opal_init_util:opal_malloc_init]: 0.000031 / 0.000023 / 0.000043
-- [opal_init_core.c:opal_init_util:opal_show_help_init]: 0.000094 / 0.000085 / 0.000108
-- [opal_init_core.c:opal_init_util:opal_var_init]: 0.000002 / 0.000001 / 0.000003
-- [opal_init_core.c:opal_init_util:opal_var_cache]: 0.000399 / 0.000345 / 0.000442
-- [opal_init_core.c:opal_init_util:opal_arch_init]: 0.000057 / 0.000054 / 0.000065
-- [opal_init_core.c:opal_init_util:mca_base_open]: 0.000201 / 0.000178 / 0.000243
!! [opal_init_core.c:opal_init_util:total]: 0.000784 / 0.000686 / 0.000904
-- [opal_init.c:opal_init:opal_if_init]: 0.000074 / 0.000062 / 0.000084
-- [opal_init.c:opal_init:opal_init_psm]: 0.000010 / 0.000009 / 0.000011
-- [opal_init.c:opal_init:opal_net_init]: 0.000010 / 0.000008 / 0.000012
-- [opal_init.c:opal_init:opal_datatype_init]: 0.003596 / 0.000519 / 0.012865
!! [opal_init.c:opal_init:total]: 0.003689 / 0.000598 / 0.012972
-- [instance.c:ompi_mpi_instance_init_common:initialization]: 0.000991 / 0.000924 / 0.001064
-- [instance.c:ompi_mpi_instance_init_common:ompi_rte_init]: 0.007519 / 0.004406 / 0.016369
-- [instance.c:ompi_mpi_instance_init_common:PMIx_Commit]: 0.003164 / 0.002496 / 0.003640
-- [instance.c:ompi_mpi_instance_init_common:pmix-barrier-1]: 0.007725 / 0.000072 / 0.010423
-- [instance.c:ompi_mpi_instance_init_common:pmix-barrier-2]: 0.000138 / 0.000068 / 0.000159
-- [instance.c:ompi_mpi_instance_init_common:modex]: 0.000181 / 0.000115 / 0.000333
-- [instance.c:ompi_mpi_instance_init_common:modex-barrier]: 0.003143 / 0.002944 / 0.003308
-- [instance.c:ompi_mpi_instance_init_common:barrier]: 0.000373 / 0.000161 / 0.000618
!! [instance.c:ompi_mpi_instance_init_common:total]: 0.023234 / 0.011186 / 0.035914
[ompi_mpi_init.c:ompi_mpi_init:barrier-finish]: 0.023557 / 0.023051 / 0.024240 [ompi_mpi_init:total] 0.023557 / 0.023051 / 0.024240 [ompi_mpi_init:overhead]: 0.000240

The timing points can be refined by others depending on their needs.

Related to #11166

Signed-off-by: Howard Pritchard [email protected]

@jsquyres
Copy link
Member

bot:aws-v2:retest

@jsquyres jsquyres requested a review from janjust January 18, 2023 16:17
@awlauria
Copy link
Contributor

@jsquyres do you know how to retest this that PR failure?

@rhc54
Copy link
Contributor

rhc54 commented Jan 20, 2023

@jsquyres do you know how to retest this that PR failure?

It's the new bot command:
bot:aws-v2:retest

@hppritcha
Copy link
Member Author

bot:aws-v2:retest

@hppritcha
Copy link
Member Author

not sure this bot:aws-v2:retest is actually working

@bwbarrett
Copy link
Member

@hppritcha can you rebase this PR on the head of main? That will fix the CI issue.

The long story on why it will fix it is that when we initially started rolling out the PR tester, we had it configured to rebase the PR on the target branch and test that. Because the target branch (main in this case) had the jenkinsfile committed, Jenkins tried to run the test. But the downside is that every time there is a commit to the target branch, all the target PRs rebuild, which is ugh.

So we changed the config to build the PR branch, which for PRs opened after the Jenkinsfile was committed, worked fine. But for PRs like yours that were last rebased before the Jenkinsfile was committed, Jenkins is now in a state where it doesn't know how to build this PR. Rebasing will pull in the Jenkinsfile and the test will run again.

This patch removes redundant PMIx Fences in the initialization
procedure for MPI when using the World Process Model (WPM).
See chapter 11 sections 2 and 3 of the MPI-4 standard for a discussion
of the WPM and new Sessions model.

The patch does, however, require that what should have been a local operation
to support initialization of an MPI session, into a global one.
Note this does not disable the sessions feature but just restricts when it
will work at this point to use cases that are similar to MPI initialization
using the WPM.

Refactoring to make ompi_mpi_instance_init_common  purely local will require changes that would
be too impactive for the current state of the 5.0.0 release cycle.
See issue open-mpi#11239.

This patch also fixes up the timings reported when building using the
timing infrastructure:

mpirun -np 8 ./ring_c
------------------ ompi_mpi_init ------------------
 -- [opal_init_core.c:opal_init_util:opal_malloc_init]: 0.000031 / 0.000023 / 0.000043
 -- [opal_init_core.c:opal_init_util:opal_show_help_init]: 0.000094 / 0.000085 / 0.000108
 -- [opal_init_core.c:opal_init_util:opal_var_init]: 0.000002 / 0.000001 / 0.000003
 -- [opal_init_core.c:opal_init_util:opal_var_cache]: 0.000399 / 0.000345 / 0.000442
 -- [opal_init_core.c:opal_init_util:opal_arch_init]: 0.000057 / 0.000054 / 0.000065
 -- [opal_init_core.c:opal_init_util:mca_base_open]: 0.000201 / 0.000178 / 0.000243
 !! [opal_init_core.c:opal_init_util:total]: 0.000784 / 0.000686 / 0.000904
 -- [opal_init.c:opal_init:opal_if_init]: 0.000074 / 0.000062 / 0.000084
 -- [opal_init.c:opal_init:opal_init_psm]: 0.000010 / 0.000009 / 0.000011
 -- [opal_init.c:opal_init:opal_net_init]: 0.000010 / 0.000008 / 0.000012
 -- [opal_init.c:opal_init:opal_datatype_init]: 0.003596 / 0.000519 / 0.012865
 !! [opal_init.c:opal_init:total]: 0.003689 / 0.000598 / 0.012972
 -- [instance.c:ompi_mpi_instance_init_common:initialization]: 0.000991 / 0.000924 / 0.001064
 -- [instance.c:ompi_mpi_instance_init_common:ompi_rte_init]: 0.007519 / 0.004406 / 0.016369
 -- [instance.c:ompi_mpi_instance_init_common:PMIx_Commit]: 0.003164 / 0.002496 / 0.003640
 -- [instance.c:ompi_mpi_instance_init_common:pmix-barrier-1]: 0.007725 / 0.000072 / 0.010423
 -- [instance.c:ompi_mpi_instance_init_common:pmix-barrier-2]: 0.000138 / 0.000068 / 0.000159
 -- [instance.c:ompi_mpi_instance_init_common:modex]: 0.000181 / 0.000115 / 0.000333
 -- [instance.c:ompi_mpi_instance_init_common:modex-barrier]: 0.003143 / 0.002944 / 0.003308
 -- [instance.c:ompi_mpi_instance_init_common:barrier]: 0.000373 / 0.000161 / 0.000618
 !! [instance.c:ompi_mpi_instance_init_common:total]: 0.023234 / 0.011186 / 0.035914
[ompi_mpi_init.c:ompi_mpi_init:barrier-finish]: 0.023557 / 0.023051 / 0.024240
[ompi_mpi_init:total] 0.023557 / 0.023051 / 0.024240
[ompi_mpi_init:overhead]: 0.000240

The timing points can be refined by others depending on their needs.

Related to open-mpi#11166

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member Author

bot:aws:retest

@jjhursey
Copy link
Member

bot:ibm:retest

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha force-pushed the pmix_fence_removal_v3 branch from 5b93b3b to 712f0e4 Compare January 27, 2023 14:56
@hppritcha hppritcha removed the request for review from jsquyres February 21, 2023 16:18
@hppritcha
Copy link
Member Author

@jjhursey could you review this? I'd like to get this in before 5.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants