Skip to content

coll/han: call fallback functin when HAN module is disabled #11454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 8, 2023

Conversation

wzamazon
Copy link
Contributor

This patch is to address:
#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called even if HAN has been disabled, which resulted an infinitely recursive calling loop.

To address this issue, this patch make HAN's collective fucntion to call fallback function when HAN module was disabled.

This patch is to address:
    open-mpi#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and
HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called
even if HAN has been disabled, which resulted an infinitely
recursive calling loop.

To address this issue, this patch make HAN's collective
fucntion to call fallback function when HAN module was
disabled.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon
Copy link
Contributor Author

wzamazon commented Mar 2, 2023

@bosilca Can you review?

@bosilca bosilca force-pushed the coll_han_fix_infinite_loop branch from ad645fc to 8345dae Compare March 8, 2023 21:16
Instead of calling the communicator collective module after
disqualifying HAN call directly the fallback collective. This avoids
circular dependencies between modules in some cases. However, while this
solution works it deliver suboptimal performance as the withdrawn module
will remain in the call chain, but will behave as a passthrough.

Signed-off-by: George Bosilca <[email protected]>
@bosilca bosilca force-pushed the coll_han_fix_infinite_loop branch from 8345dae to 00de8b7 Compare March 8, 2023 21:17
@wzamazon
Copy link
Contributor Author

wzamazon commented Mar 8, 2023

I can confirm it fixed the issue.

@wckzhang wckzhang merged commit e0bf553 into open-mpi:main Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants