-
Notifications
You must be signed in to change notification settings - Fork 900
Improvements and fixes for coll/HAN #10438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@bosilca Any thoughts on this? |
For now, we have 2 similar implementation in the han component
The implementation 1 is the historical one. We extended han topology-aware capabilities one our side: we added the support n-levels with the implementation 2 and we consider pushing it to the github repo but we do not know when yet. |
Thanks for the responses here and in the PR, I now better understand what goes on in HAN! I presume that the idea in all our minds, is that at some point the dynamic path will supersede the historical one. The dynamic path is not currently a match for some of the historical implementations in HAN, i.e. those that use non-blocking collectives, and this is the reason that it has not yet replaced the old path. From what I understand there's not a great technical obstacle behind making han-dynamic handle non-blocking collectives; one "speedbump" that @FlorentGermain-Bull mentioned in our call is that base's My suggestions (1) and (2) from the original post still stand, and I suppose they should be integrated into the dynamic path rather than the old one. Regarding (1), you mention that you have implemented selection by name instead of ID in the configuration file; does that also mean that any component can be arbitrarily chosen only via its name, or does the fixed range of selections remain (enum At the moment the dynamic path is effectively disabled, to overcome a bug that existed in it. I plan to take a look into re-enabling it and fixing the bug that results in #8248, along with some other small ones I have noticed, and submitting PRs. Or maybe it is preferable that it remains disabled until it completely replaces the historical one, to avoid confusing users with which path is used for each collective? |
To extend the dynamic part of han on non-blocking collectives, extending COLLTYPE_T enum to non-blocking collective and write dynamic functions for these non-blocking collectives should be enough. To support them by name, This fix we talked about yesterday for #8248, it probably was a copy-paste or a merging error, sorry for that one: eeb479e (1) I think we can manage components without COMPONENT_T enum: components are identified when stored in the module_list ( (2) For now, we use MCA parameters default values as default behaviour. I do not know how to detect if the user has explicitely set an MCA parameter to its default value. |
I see the fix yes. Will you make a PR? (1) Yes I believe that since the component names are available in (2) I think we can achieve this by having NULL by default for the primitive-specific parameters. So in the HAN component, set default values |
PR #10456 created (I know a python script has rejected it, I'm checking why) (1) yes exactly (2) Another issue havind a global MCA parameter is that the only component which can be used for all the collectives must provide an implementation for all the collectives. In particular, if we extend han to non-blocking collectives, there is no component that provides an implementation for both blocking and non-blocking collectives. |
Perfect (the signed-off-by thing is missing). (2) Yes it could get a bit confusing. Perhaps another approach would be for all MCA parameters (including the non-coll-specific ones) to remain unset (NULL) by default. In this approach, the first non-han component that implements each collective, i.e. the one with the highest priority, would be chosen for the subcommunicator. And the parameters would be there to override and fine-tune, with the responsibility for correct choice lying with the tuner. (In any case (2) is a less important item..) |
Improve the module selection for the up and low collective modules to allow the, more user-friendly, use of the module name in addition to module number. This is a partial fix for open-mpi#10438 Signed-off-by: George Bosilca <[email protected]>
Improve the module selection for the up and low collective modules to allow the, more user-friendly, use of the module name in addition to module number. This is a partial fix for open-mpi#10438 Signed-off-by: George Bosilca <[email protected]>
Improve the module selection for the up and low collective modules to allow the, more user-friendly, use of the module name in addition to module number. This is a partial fix for open-mpi#10438 Signed-off-by: George Bosilca <[email protected]> (cherry picked from commit e572aee)
Improve the module selection for the up and low collective modules to allow the, more user-friendly, use of the module name in addition to module number. This is a partial fix for open-mpi#10438 Signed-off-by: George Bosilca <[email protected]>
Hi, I have been looking into the HAN collective component, and would like to suggest some usability improvements and some fixes. I was planning on implementing these improvements (or some/most of them) and submitting PRs myself. So, in this issue, I'm looking for the "green light" that these suggestions are desirable, or any ideas/comments regarding them, or to know if someone else is already working on them or something similar.
I suggest adjusting the component choice to be based on the name (string) of the collective component to utilize, and remove the fixed selections. This will allow easier tuning (strings instead of IDs), and the possibility to use any component for each comm, without code modification. Example:
--mca coll_han_bcast_up_module adapt --mca coll_han_bcast_low_module sm
.Currently, parameters are in the form of
coll_han_<coll>_up_module
,coll_han_<coll>_down_module
,coll_han_<coll>_segsize
,coll_han_use_simple_<coll>
. While keeping these, example of addition:coll_han_up_module
,coll_han_down_module
. The primitive-specific parameters would override the new non-primitive-specific parameter, if set.In the context of (1) and (2), I would also seek to unify
mca_coll_han_comm_create()
andmca_coll_han_comm_create_new()
(?).FYI, for anyone working on HAN, I believe that #10335 also affects (?) the
ompi_comm_coll_preference
info key that is used to influence the component selection for each subcomm.The text was updated successfully, but these errors were encountered: