Closed
Description
Hi, I'm using opal/smsc in a collectives component. When pml/ob1+btl/sm are used, all works correctly. However, if instead ucx is configured as the PML, smsc remains uninitialized. I noticed that btl/sm calls mca_smsc_base_select()
, and tried calling that in my code, but even so, later calls to get_endpoint()
fail.
So, how should I go about initializing smsc in my code when it's not initialized elsewhere?
In case the the mca_smsc_base_select
call is all that is needed, consider this a bug report :-)
I'm on v5.0.0rc6
To reproduce:
diff --git a/ompi/mca/coll/sm/coll_sm_module.c b/ompi/mca/coll/sm/coll_sm_module.c
index ba3c62ce1c..b89f048f51 100644
--- a/ompi/mca/coll/sm/coll_sm_module.c
+++ b/ompi/mca/coll/sm/coll_sm_module.c
@@ -220,6 +220,7 @@ mca_coll_sm_comm_query(struct ompi_communicator_t *comm, int *priority)
return &(sm_module->super);
}
+#include "opal/mca/smsc/base/base.h"
/*
* Init module on the communicator
@@ -234,7 +235,21 @@ static int sm_module_enable(mca_coll_base_module_t *module,
ompi_comm_print_cid (comm), comm->c_name);
return OMPI_ERROR;
}
-
+
+ if(mca_smsc == NULL) {
+ mca_smsc_base_select();
+ printf("smsc base init %s\n", (mca_smsc ? "success" : "fail"));
+ } else
+ printf("smsc already initialized\n");
+
+ int rank = ompi_comm_rank(comm);
+ int comm_size = ompi_comm_size(comm);
+
+ ompi_proc_t *peer = ompi_comm_peer_lookup(comm, (rank + 1) % comm_size);
+ mca_smsc_endpoint_t *smsc_ep = MCA_SMSC_CALL(get_endpoint, &peer->super);
+
+ printf("smsc_ep = %p\n", smsc_ep);
+
/* We do everything lazily in ompi_coll_sm_enable() */
return OMPI_SUCCESS;
}
diff --git a/opal/mca/smsc/xpmem/smsc_xpmem_module.c b/opal/mca/smsc/xpmem/smsc_xpmem_module.c
index 6a3444a35d..4bb688f66c 100644
--- a/opal/mca/smsc/xpmem/smsc_xpmem_module.c
+++ b/opal/mca/smsc/xpmem/smsc_xpmem_module.c
@@ -42,6 +42,8 @@ mca_smsc_endpoint_t *mca_smsc_xpmem_get_endpoint(opal_proc_t *peer_proc)
OPAL_MODEX_RECV_IMMEDIATE(rc, &mca_smsc_xpmem_component.super.smsc_version,
&peer_proc->proc_name, (void **) &modex, &modex_size);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
+ printf("OPAL_MODEX_RECV_IMMEDIATE() failed @ smsc/xpmem get_endpoint\n");
+
OBJ_RELEASE(endpoint);
return NULL;
}
$ mpirun -n 2 --mca coll basic,libnbc,sm --mca coll_sm_priority 100 --mca smsc xpmem --mca pml ucx osu_bcast
smsc base init success
smsc base init success
OPAL_MODEX_RECV_IMMEDIATE() failed @ smsc/xpmem get_endpoint
smsc_ep = (nil)
OPAL_MODEX_RECV_IMMEDIATE() failed @ smsc/xpmem get_endpoint
smsc_ep = (nil)
With pml=ob1, all works ok:
smsc already initialized
smsc_ep = 0x2a4f8270
smsc already initialized
smsc_ep = 0x215581e0
Metadata
Metadata
Assignees
Labels
No labels