Description
Seen in HEAD of OMPI main branch
I'm honestly not sure when this started as I haven't been tracking it. This only was detected when we were working on the new shared memory implementation for PMIx and couldn't figure out why we were seeing so many fence operations. Finally traced it down to OMPI calling PMIx_Fence
on five separate occasions during a simple hello
application.
A quick grep of your MPI layer shows the following places where PMIx_Fence
is getting called:
ompi/runtime/ompi_mpi_finalize.c
284: if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, NULL, 0, fence_cbfunc, (void*)&active))) {
ompi/runtime/ompi_mpi_init.c
398: if (PMIX_SUCCESS != (rc = PMIx_Fence(NULL, 0, NULL, 0))) {
404: if (PMIX_SUCCESS != (rc = PMIx_Fence(NULL, 0, NULL, 0))) {
429: if( PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, NULL, 0,
433: error = "PMIx_Fence_nb() failed";
445: rc = PMIx_Fence_nb(NULL, 0, info, 1, fence_release, (void*)&active);
448: error = "PMIx_Fence() failed";
511: if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,
514: error = "PMIx_Fence_nb() failed";
ompi/instance/instance.c
527: if (PMIX_SUCCESS != (rc = PMIx_Fence(NULL, 0, NULL, 0))) {
532: if (PMIX_SUCCESS != (rc = PMIx_Fence(NULL, 0, NULL, 0))) {
555: if( PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, NULL, 0,
559: return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);
570: rc = PMIx_Fence_nb(NULL, 0, info, 1, fence_release, (void*)&active);
573: return ompi_instance_print_error ("PMIx_Fence() failed", ret);
732: if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,
735: return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);
ompi/dpm/dpm.c
650: if (PMIX_SUCCESS != (rc = PMIx_Fence(procs, nprocs, NULL, 0))) {
04:00:25 (main) ~/openmpi/foobar$
We can discount the dpm
as that wasn't involved in this simple app. We know that MPI_Finalize
has to call it because some of your transports need it, so that is one. We know MPI_Init
calls it twice - once to do the modex and once to provide a barrier at the end of MPI_Init
to ensure that all transports are ready. So that means we should see it three times.
I added some print statements and found that this "instance.c" code seems to be adding one call, apparently prior to MPI_Init
calling "fence", and another during "finalize". I have no idea why it is calling "fence". Perhaps someone can look and see if this is actually necessary? Can't they be bundled into the other "fence" calls?
Timing tests show it is a pretty significant regression from prior behaviors, so it might be worth a little investigation prior to release.