Skip to content

How to initialize smsc when btl/sm is not used? #10342

Closed
@gkatev

Description

@gkatev

Hi, I'm using opal/smsc in a collectives component. When pml/ob1+btl/sm are used, all works correctly. However, if instead ucx is configured as the PML, smsc remains uninitialized. I noticed that btl/sm calls mca_smsc_base_select(), and tried calling that in my code, but even so, later calls to get_endpoint() fail.

So, how should I go about initializing smsc in my code when it's not initialized elsewhere?


In case the the mca_smsc_base_select call is all that is needed, consider this a bug report :-)

I'm on v5.0.0rc6

To reproduce:

diff --git a/ompi/mca/coll/sm/coll_sm_module.c b/ompi/mca/coll/sm/coll_sm_module.c
index ba3c62ce1c..b89f048f51 100644
--- a/ompi/mca/coll/sm/coll_sm_module.c
+++ b/ompi/mca/coll/sm/coll_sm_module.c
@@ -220,6 +220,7 @@ mca_coll_sm_comm_query(struct ompi_communicator_t *comm, int *priority)
     return &(sm_module->super);
 }
 
+#include "opal/mca/smsc/base/base.h"
 
 /*
  * Init module on the communicator
@@ -234,7 +235,21 @@ static int sm_module_enable(mca_coll_base_module_t *module,
                            ompi_comm_print_cid (comm), comm->c_name);
         return OMPI_ERROR;
     }
-
+    
+    if(mca_smsc == NULL) {
+        mca_smsc_base_select();
+        printf("smsc base init %s\n", (mca_smsc ? "success" : "fail"));
+    } else
+        printf("smsc already initialized\n");
+    
+    int rank = ompi_comm_rank(comm);
+    int comm_size = ompi_comm_size(comm);
+    
+    ompi_proc_t *peer = ompi_comm_peer_lookup(comm, (rank + 1) % comm_size);
+    mca_smsc_endpoint_t *smsc_ep = MCA_SMSC_CALL(get_endpoint, &peer->super);
+    
+    printf("smsc_ep = %p\n", smsc_ep);
+    
     /* We do everything lazily in ompi_coll_sm_enable() */
     return OMPI_SUCCESS;
 }
diff --git a/opal/mca/smsc/xpmem/smsc_xpmem_module.c b/opal/mca/smsc/xpmem/smsc_xpmem_module.c
index 6a3444a35d..4bb688f66c 100644
--- a/opal/mca/smsc/xpmem/smsc_xpmem_module.c
+++ b/opal/mca/smsc/xpmem/smsc_xpmem_module.c
@@ -42,6 +42,8 @@ mca_smsc_endpoint_t *mca_smsc_xpmem_get_endpoint(opal_proc_t *peer_proc)
     OPAL_MODEX_RECV_IMMEDIATE(rc, &mca_smsc_xpmem_component.super.smsc_version,
                               &peer_proc->proc_name, (void **) &modex, &modex_size);
     if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
+        printf("OPAL_MODEX_RECV_IMMEDIATE() failed @ smsc/xpmem get_endpoint\n");
+        
         OBJ_RELEASE(endpoint);
         return NULL;
     }
$ mpirun -n 2 --mca coll basic,libnbc,sm --mca coll_sm_priority 100 --mca smsc xpmem --mca pml ucx osu_bcast
smsc base init success
smsc base init success
OPAL_MODEX_RECV_IMMEDIATE() failed @ smsc/xpmem get_endpoint
smsc_ep = (nil)
OPAL_MODEX_RECV_IMMEDIATE() failed @ smsc/xpmem get_endpoint
smsc_ep = (nil)

With pml=ob1, all works ok:

smsc already initialized
smsc_ep = 0x2a4f8270
smsc already initialized
smsc_ep = 0x215581e0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions