-
Notifications
You must be signed in to change notification settings - Fork 900
Does ompi_comm_split_type actually pass the "info" to the newly created "newcomm"? #11181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jywangx, indeed it's broken!, and it essentially render's HAN submodule customization ineffective. See also here: #10335 (it is the same issue, correct?). I've been using the patch at the end of my OP in this issue until the problem is fixed, and it's been working, not sure if it's a hack or a/the proper solution. |
Thanks very much for your help❤ @gkatev! I see your comment in #10335 (comment). Does it work to subscribe |
I'm not 100% sure about these comments either :-), it's been some time too. As far as I know, currently, the info is gone before the coll component's comm_query or module_enable functions are called, so there's no way to get them (or subscribe to them?) there. And also there's a second slightly related detail, that would matter if the above problem wasn't present, that unreferenced info keys are deleted at the end of the split/dup function, so you have to get them in comm_query or module_enable, or they will be gone by the time you lazy-initialize. Does this make more sense? I might take another look into this these days and re-read the issues/comments, and see if I arrive to an epiphany about a proper fix (either practical or fully fledged). However if you need to fix this to get going, the patch at the end of this comment: #10335 (comment) has been working for me. |
Okay I will try it. Thanks again! |
Should be fixed by #12498 (?) |
@gkatev Yes. I tried it on my end it fixed my problem. Please also verify and close the issue it's resolved. |
Yes it also worked for me (fyi the original author here is @jywangx) |
Thanks @gkatev. Closing this issue. |
Description
In file
ompi/communicator/comm.c
, declaration of funcationompi_comm_split_type
is:In the end it may call function
ompi_comm_split_type_core
, which performs common processing for it. And theinfo
seems to be passed toompi_communicator_t **newcomm
by following code near line981
:It looks like that
opal_infosubscribe_change_info
is needed to read all information stored ininfo
and write it tonewcomm
. Butinfo
andnewcomm
here are all newly created, in the functionopal_infosubscribe_change_info
,&newcomp->super->s_info
is set byopal_info_set
only if the return value ofopal_infosubscribe_inform_subscribers
is notNULL
and equlas to a value ininfo
.In function
opal_infosubscribe_inform_subscribers
, variableslist
is initialized toNULL
, and then assigned by call toopal_hash_table_get_value_ptr(table, key, strlen(key), (void **) &list);
. As mentioned earlier,newcomm
andinfo
are all newly created, and maybe some keys do not yet exist in&object->s_subscriber_table
, which make thelist
is stillNULL
after theopal_hash_table_get_value_ptr
, and the return of this function will beNULL
, too. In this caseinfo
will not be set tonewcomm
.In an old verion
ompi_comm_split_type
the 'info' is set by the following code, and the change occurred in #9097. I'm not sure whether I'm understanding the code correctly, so I'd like to confirm if this is a bug.Background information
In fact I'm doing some work on collective communication algorithm optimization, and need to divide the communicator to achieve hierarchical communication. My implementation is similar to
mca_coll_han_comm_create_new
in HAN, I encountered the above problem when passinginfo
to a newly created subcomm byompi_comm_split_type
.I was previously developing in ompi version 4.1, and error occurs after I migrated my code to the main branch. After debuging the problem seems to be caused by changes of the way of setting
info
innewcomm
.I also tried to verify if the same problem exists in the HAN component by insert some print into HAN's code:
It seems that the subcomm also cannot get the value of
ompi_comm_coll_han_topo_level
set by ompi_comm_split_type. Some of the output is as follow, they are all in order:The text was updated successfully, but these errors were encountered: