Skip to content

btl/sm-opal/shmem finalize order & shmem unlink error #11123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gkatev opened this issue Nov 29, 2022 · 0 comments
Open

btl/sm-opal/shmem finalize order & shmem unlink error #11123

gkatev opened this issue Nov 29, 2022 · 0 comments

Comments

@gkatev
Copy link
Contributor

gkatev commented Nov 29, 2022

Hi, I'd like to report two low-impact issues to do with finalization of mmap shmem areas.

OpenMPI main (#4b39d07)
$ git submodule status
 250004266bc046c6303c8531ababdff4e1237525 ../../../../3rd-party/openpmix (v1.1.3-3661-g25000426)
 ca2bf3aeab38261ae7c88cea64bc782c949bd76e ../../../../3rd-party/prrte (psrvr-v2.0.0rc1-4517-gca2bf3aeab)
 5c8de3d97b763bf8981fb49cbedd36e201b8fc0a ../../../../config/oac (5c8de3d)

Issue 1, btl/sm and opal/shmem finalization order

It looks like btl/sm is finalized after components in opal/shmem, and therefore sm_finalize() is called after the shmem component has been closed, making the calls to opal_shmem_unlink and opal_shmem_segment_detach have no effect.
I'm not sure how straightforward it might be to resolve this, e.g. by delaying the finalization of opal/shmem.

To reproduce:

diff --git a/opal/mca/btl/sm/btl_sm_module.c b/opal/mca/btl/sm/btl_sm_module.c
index 7835742e4f..eb83ff9af4 100644
--- a/opal/mca/btl/sm/btl_sm_module.c
+++ b/opal/mca/btl/sm/btl_sm_module.c
@@ -345,6 +345,8 @@ static int sm_finalize(struct mca_btl_base_module_t *btl)
     free(component->fbox_in_endpoints);
     component->fbox_in_endpoints = NULL;
 
+    printf("sm_finalize() do unlink/detach\n");
+
     opal_shmem_unlink(&mca_btl_sm_component.seg_ds);
     opal_shmem_segment_detach(&mca_btl_sm_component.seg_ds);
 
diff --git a/opal/mca/shmem/base/shmem_base_close.c b/opal/mca/shmem/base/shmem_base_close.c
index 415ea6c22e..0e69ee721c 100644
--- a/opal/mca/shmem/base/shmem_base_close.c
+++ b/opal/mca/shmem/base/shmem_base_close.c
@@ -31,6 +31,10 @@
 /* ////////////////////////////////////////////////////////////////////////// */
 int opal_shmem_base_close(void)
 {
+    printf("opal_shmem_base_close()\n");
+
     /* if there is a selected shmem module, finalize it */
     if (NULL != opal_shmem_base_module && NULL != opal_shmem_base_module->module_finalize) {
         opal_shmem_base_module->module_finalize();
diff --git a/opal/mca/shmem/base/shmem_base_wrappers.c b/opal/mca/shmem/base/shmem_base_wrappers.c
index b1b0c02f6e..5e8827151e 100644
--- a/opal/mca/shmem/base/shmem_base_wrappers.c
+++ b/opal/mca/shmem/base/shmem_base_wrappers.c
@@ -59,6 +59,8 @@ void *opal_shmem_segment_attach(opal_shmem_ds_t *ds_buf)
 int opal_shmem_segment_detach(opal_shmem_ds_t *ds_buf)
 {
     if (!opal_shmem_base_selected) {
+        printf("NOT SELECTED\n");
+
         return OPAL_ERROR;
     }
 
@@ -69,6 +71,8 @@ int opal_shmem_segment_detach(opal_shmem_ds_t *ds_buf)
 int opal_shmem_unlink(opal_shmem_ds_t *ds_buf)
 {
     if (!opal_shmem_base_selected) {
+        printf("NOT SELECTED\n");
+
         return OPAL_ERROR;
     }
$ mpirun -n 2 --mca btl sm,self --output tag osu_bcast -m 4:4 2>&1 | grep 0]
[mpirun-gkpc-701889@1,0]<stdout>: 
[mpirun-gkpc-701889@1,0]<stdout>: # OSU MPI_Bcast (data-varying) v7.0
[mpirun-gkpc-701889@1,0]<stdout>: # Size       Avg Latency(us)
[mpirun-gkpc-701889@1,0]<stdout>: root = 0
[mpirun-gkpc-701889@1,0]<stdout>: 4                       0.40
[mpirun-gkpc-701889@1,0]<stdout>: opal_shmem_base_close()
[mpirun-gkpc-701889@1,0]<stdout>: sm_finalize() do unlink/detach
[mpirun-gkpc-701889@1,0]<stdout>: NOT SELECTED
[mpirun-gkpc-701889@1,0]<stdout>: NOT SELECTED

However, there appears to be a second issue lurking:

Issue 2, error unlinking mmap backing file

If we further do something like this to temporarily resolve issue 1:

diff --git a/opal/mca/shmem/base/shmem_base_close.c b/opal/mca/shmem/base/shmem_base_close.c
index 415ea6c22e..32cae9d021 100644
--- a/opal/mca/shmem/base/shmem_base_close.c
+++ b/opal/mca/shmem/base/shmem_base_close.c
@@ -31,6 +31,9 @@
 /* ////////////////////////////////////////////////////////////////////////// */
 int opal_shmem_base_close(void)
 {
+    printf("opal_shmem_base_close() (fake)\n");
+    return;
+
     /* if there is a selected shmem module, finalize it */
     if (NULL != opal_shmem_base_module && NULL != opal_shmem_base_module->module_finalize) {
         opal_shmem_base_module->module_finalize();
$ mpirun -n 2 --mca btl sm,self --output tag osu_bcast_dv -m 4:4 2>&1 | grep 0]
[mpirun-gkpc-712148@1,0]<stdout>: 
[mpirun-gkpc-712148@1,0]<stdout>: # OSU MPI_Bcast (data-varying) v7.0
[mpirun-gkpc-712148@1,0]<stdout>: # Size       Avg Latency(us)
[mpirun-gkpc-712148@1,0]<stdout>: root = 0
[mpirun-gkpc-712148@1,0]<stdout>: 4                       0.36
[mpirun-gkpc-712148@1,0]<stdout>: opal_shmem_base_close() (fake)
[mpirun-gkpc-712148@1,0]<stdout>: sm_finalize() do unlink/detach
[mpirun-gkpc-712148@1,0]<stderr>: --------------------------------------------------------------------------
[mpirun-gkpc-712148@1,0]<stderr>: A system call failed during shared memory initialization that should
[mpirun-gkpc-712148@1,0]<stderr>: not have.  It is likely that your MPI job will now either abort or
[mpirun-gkpc-712148@1,0]<stderr>: experience performance degradation.
[mpirun-gkpc-712148@1,0]<stderr>: 
[mpirun-gkpc-712148@1,0]<stderr>:   Local host:  gkpc
[mpirun-gkpc-712148@1,0]<stderr>:   System call: unlink(2) /dev/shm/sm_segment.gkpc.1000.16b00001.0
[mpirun-gkpc-712148@1,0]<stderr>:   Error:       No such file or directory (errno 2)
[mpirun-gkpc-712148@1,0]<stderr>: --------------------------------------------------------------------------

This is not specific to btl/sm, I initially stumbled upon it in my coll component. It looks as if the /dev/shm backing files somehow get deleted before their proper spot. (this won't normally trigger because of issue 1). Now if I put on my mad-debugger glasses (TM) and take a shot at finding out where something like this might happen:

If I place a while(1) {} before the call to PMIx_Finalize here:

int ompi_rte_finalize(void)
{
/* shutdown pmix */
PMIx_Finalize(NULL, 0);

and inspect the contents of /dev/shm while it's hanging, I see sm's backing files as expected. If I move the hang-loop after the call to PMIx_Finalize and do the same thing, the files are gone. Might pmix be removing these files somehow?

If I follow the rabbit hole a bit, it leads to this call: https://github.com/openpmix/openpmix/blob/250004266bc046c6303c8531ababdff4e1237525/src/client/pmix_client.c#L1075
Placing the while before/after it triggers/untriggers the behavior. (doesn't really look like it unlinks backing files, I know, perhaps it triggers the actual code that does it?)

Edit: Appears to be happening here: https://github.com/openpmix/openpmix/blob/250004266bc046c6303c8531ababdff4e1237525/src/include/pmix_globals.c#L521

diff --git a/src/include/pmix_globals.c b/src/include/pmix_globals.c
index 5ad717be..c047dbe5 100644
--- a/src/include/pmix_globals.c
+++ b/src/include/pmix_globals.c
@@ -518,6 +518,7 @@ void pmix_execute_epilog(pmix_epilog_t *epi)
                                     (unsigned long) epi->gid);
                 continue;
             }
+            printf("PMIX UNLINK %s\n", tmp[n]);
             rc = unlink(tmp[n]);
             if (0 != rc) {
                 pmix_output_verbose(10, pmix_globals.debug_output, "File %s failed to unlink: %d",
PMIX UNLINK /dev/shm/sm_segment.gkpc.1000.89e00001.0
PMIX UNLINK /dev/shm/sm_segment.gkpc.1000.89e00001.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant