Skip to content

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME  #1230

@sekifjikkatsu

Description

@sekifjikkatsu

I think there is a problem in the btl/sm component.
It is the synchronization problem that is between the node master process and other child processes.
Other child processes read 0 byte from shared_mem_btl_rndv.HOSTNAME file before the node master
process writes sizeof(opal_shmem_ds_t) bytes to the file.
I think this problem can be solved with the following correction.

Index: ompi/mca/btl/sm/btl_sm.c
===================================================================
--- ompi/mca/btl/sm/btl_sm.c    (リビジョン 4013)
+++ ompi/mca/btl/sm/btl_sm.c    (作業コピー)
@@ -180,6 +180,7 @@
 static int
 sm_segment_attach(mca_btl_sm_component_t *comp_ptr)
 {
+    struct stat buf;
     int rc = OMPI_SUCCESS;
     int fd = -1;
     ssize_t bread = 0;
@@ -195,6 +196,14 @@
         rc = OMPI_ERR_IN_ERRNO;
         goto out;
     }
+    do {
+        if (0 != fstat(fd,&buf)) {
+            opal_output(0, "sm_segment_attach: "
+                           "fstat errno=%d\n",errno);
+            rc = OMPI_ERROR;
+            goto out;
+        }
+    } while (sizeof(opal_shmem_ds_t) != buf.st_size);
     if ((ssize_t)sizeof(opal_shmem_ds_t) != (bread =
         read(fd, tmp_shmem_ds, sizeof(opal_shmem_ds_t)))) {
         opal_output(0, "sm_segment_attach: "

This problem occurs in the Open MPI version 1.8.
And it may not occur in the Open MPI master.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions