-
Notifications
You must be signed in to change notification settings - Fork 900
coll/han: dynamic selection does not work for simple algorithms #9883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mkurnosov Thanks for the report! I'm not quite sure I can follow the problem here. Do you see an error occurring with the command you posted? I ran it locally and the command succeeds for me. The call to |
@devreal the program terminates without an error. In the example above, we expect that HAN calls Bcast from tuned component, but it uses basic component. It can be catched by debug printf in basic/bcast. |
This case should be handled in the current code. Check the code at the end of the I don't have your test and I could not find it in ompi-tests. So, let's try to following patch. Please run with diff --git a/ompi/mca/coll/base/coll_base_comm_select.c b/ompi/mca/coll/base/coll_base_comm_select.c
index fcdb8649eb..db039c2328 100644
--- a/ompi/mca/coll/base/coll_base_comm_select.c
+++ b/ompi/mca/coll/base/coll_base_comm_select.c
@@ -461,6 +461,9 @@ static opal_list_t *check_components(opal_list_t * components,
if (0 == strcmp(item->ac_component_name, coll_include[idx])) {
opal_list_remove_item(selectable, &item->super);
opal_list_append(selectable, &item->super);
+ opal_output_verbose(10, ompi_coll_base_framework.framework_output,
+ "coll:base:comm_select: component %s reordered based on info key (these messages appear in the reverse order)",
+ component->mca_component_name);
break;
}
} |
@bosilca
|
The it is not clear to me what is the issue you are trying to solve. tuned priority is 30 by default and you manually set the priority of basic to 90. The |
I just want to point out that HAN's MCA parameters for module selection are ignored:
|
That's a different issue, it has nothing to do with priorities nor with the ompi_comm_coll_preference info key. The dynamic selection of the module should be done by the get_module function in coll_han_dynamic.c, where it picks among all available modules the one indicated by the user rule. I will try to take a look tomorrow to see why it does not do the right thing. What test are you using ? |
I am using simple Bcast test and patched ompi with debug printf in coll/{basic,tuned}:
mpiexec:
|
I've also been taking a look into HAN's code these days, and was a bit confused with the component selection, especially with the From what I've gathered:
Where it gets strange, is that these are the only occurences of I see how the selection code in Furthermore, setting something like |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
9704f0f (master)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.4d07260d9f79bb7f328b1fc9107b45e683cf2c4e ../../../../3rd-party/openpmix (v1.1.3-3319-g4d07260d) 9ac0b7ecee2c97c357bf6751fdaab7a10e62df14 ../../../../3rd-party/prrte (psrvr-v2.0.0rc1-4133-g9ac0b7ec)
Please describe the system on which you are running
Details of the problem
Dynamic selection provided via MCA parameters does not work for simple algorithms. Simple algorithm (
coll_han_use_simple_<op>
) splits global communicator into intra- and inter-node sub-communicators with disabled HAN component (mca_coll_han_comm_create()
):opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han");
By this reason on sub-communicators simple algorithm uses a collective operation from component with a highest priority.
In the following example we want to choose Bcast from tuned component for intra- and inter-node communication. But simple algorithm calls Bcast from basic component (component with a highest priority).
The text was updated successfully, but these errors were encountered: