Skip to content

coll/HAN: Don't DQ HAN dynamic @ intra-node subcomm + typo fixes #10458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 29, 2022

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Jun 8, 2022

HAN disables itself when running in a single node, but that shouldn't include the subcommunicator created by HAN-dynamic.
See also #10438.

Tested on v5.0.x
Signed-off-by: George Katevenis [email protected]

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@gkatev
Copy link
Contributor Author

gkatev commented Jun 8, 2022

@FlorentGermain-Bull

@awlauria
Copy link
Contributor

awlauria commented Jun 8, 2022

ok to test

@bosilca
Copy link
Member

bosilca commented Jun 8, 2022

Fixing the typos is good, but enabling HAN at the node level needs to be backed by some evidence. Why is this necessary ? In which case this leads to improved performance ?

@gkatev
Copy link
Contributor Author

gkatev commented Jun 8, 2022

This doesn't actually allow HAN to run at the node-level, but rather HAN's intra-node subcommunicator. (note the added INTRA_NODE check in the diff)

When mca_coll_han_comm_create_new creates the inter/intra comms, it sets the INTRA_NODE/INTER_NODE info keys. These sub-comms also use HAN (#10456). HAN's dynamic functions detect if the current communicator is a sub-communicator, via the info key (topo_lvl), and they delegate to the respective component.

@bosilca
Copy link
Member

bosilca commented Jun 8, 2022

The first part of the check will only succeed if all processes are local (aka spawned by the same RTE daemon). I don't think that can happen for an INTRA comm.

@gkatev
Copy link
Contributor Author

gkatev commented Jun 8, 2022

This is the path that I have in mind:

  1. App calls Barrier
  2. Enter mca_coll_han_barrier_intra_dynamic, take 3rd if case (topo_lvl = GLOBAL_COMMUNICATOR)
  3. Enter mca_coll_han_barrier_intra_simple
  4. Enter mca_coll_han_comm_create_new
  5. Two communicators are created, both preferring HAN (assuming Coll han: fix allreduce dynamic calling internal han algo on sub_comm #10456 is merged). One communicator has INTRA_NODE set via ompi_comm_coll_han_topo_level, the other INTER_NODE
  • During low comm's creation, han is considered. Without this change, it is disqualified because the communicator spans only one node.
    • With the change, the INTRA_NODE info key is detected, and HAN is chosen for the intra-node sub-comm
  1. Control returns to mca_coll_han_barrier_intra_simple, low_comm's coll_barrier gets called
  2. Enter mca_coll_han_barrier_intra_dynamic, take 4th case (topo_lvl = INTRA_NODE)
  3. Enter the actual submodule that handles the intra-node barrier

@bosilca
Copy link
Member

bosilca commented Jun 8, 2022

In the current code the step 7 will be skipped, and the control is going directly from your step 6 to your step 8. This brings me back to my original question, is there a realistic need to give the control back to HAN on our own communicators ? The only reason I can see, is if we want to support multiple levels, but I might have missed something.

@gkatev
Copy link
Contributor Author

gkatev commented Jun 8, 2022

Oh I understand, I believe step 7 is necessary in order for HAN to decide which component/module should be called for the intra-node level. Unlike the non-dynamic path where this is decided at split-time, in the dynamic it is decided when the collective is called. Without this change, the component that gets used for the intra-node level is the one with the next higer priority, and the coll_han_<coll>_dynamic_intra_node_module parameter has no effect. I will perform some additional tests to make double sure things are indeed the way I describe them.

@gkatev
Copy link
Contributor Author

gkatev commented Jun 9, 2022

I did some extra tests for completeness (2 nodes, 2 ranks each)

Before change:

coll_han_barrier_dynamic_intra_node_module=3 (tuned)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE

coll_han_barrier_dynamic_intra_node_module=1 (basic)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE

After change:

coll_han_barrier_dynamic_intra_node_module=3 (tuned)
rank 0: Base barrier intra rec.doubling @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra rec.doubling @ INTRA_NODE

coll_han_barrier_dynamic_intra_node_module=1 (basic)
rank 0: Base barrier intra linear @ INTRA_NODE
rank 0: Base barrier intra rec.doubling @ INTER_NODE
rank 0: Base barrier intra linear @ INTRA_NODE

The thought occurs, whether HAN's dynamic function should be invoked each time a collective is called, or if the underlying module's function should be called directly instead. I'm not certain what kind of overhead the dynamic functions impose, some things do get cached, so they will only add overhead the first time. I believe we could in HAN-dynamic, instead of just call another function according to the rules, also update the communicator to not go through HAN in future calls.

@FlorentGermain-Bull
Copy link
Contributor

There are two conigurations type in han.

If we only use configuration through MCA parameters, yes we can update function pointers of c_coll in communicator structure at first dynamic call.

If we use a configuration file, module called can vary regarding message size. In that case, updating c_coll function pointers is not possible.

To add more value to dynamic choice of han, there is an example where it is usefull:
Let A and B be two components providing bcast and reduce implementation.
Has its granularity is on module scale, component choice by priority do not allow me to use bcast from A and reduce from B.

@gkatev
Copy link
Contributor Author

gkatev commented Jun 9, 2022

Ah okay make sense, I wasn't aware of the possibility for message-size-based selection

@jsquyres
Copy link
Member

Did the controversial parts of this PR become #10456? Should this PR be reduced to just the typo fixes, or closed if it is fully replaced by #10456?

@gkatev
Copy link
Contributor Author

gkatev commented Jul 17, 2022

No the two PRs don't currently overlap

@jsquyres
Copy link
Member

No the two PRs don't currently overlap

Cool. Where are we on this PR, then?

@gkatev
Copy link
Contributor Author

gkatev commented Jul 19, 2022

IMO, it's ready and good to go. @bosilca, did my explanations above address the concerns?

Edit: To clarify the relation of this PR with #10456:

@awlauria
Copy link
Contributor

@bosilca / @devreal ping - please review.

@awlauria awlauria requested a review from devreal August 19, 2022 15:13
@jsquyres
Copy link
Member

@gkatev Can you rebase this PR so that it picks up the new CI? Thanks!

@jsquyres jsquyres requested review from bosilca and devreal and removed request for bosilca and devreal September 29, 2022 12:24
@awlauria awlauria merged commit 31953c1 into open-mpi:main Sep 29, 2022
@gkatev
Copy link
Contributor Author

gkatev commented Oct 17, 2022

Should we also apply this to 5.0.x? (let me know and I will make a PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants