-
Notifications
You must be signed in to change notification settings - Fork 900
OMPI 5.0.x branch coll HAN introduces a circular dependency when disqualifying itself #11448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I will be working on this. |
What happened was an recursive calling loop between Which Open MPI is build with CUDA support,
When there is one 1 process per node, HAN will disqualify itself, and use fallback. code In this case, the fallback is |
The issue is the macro so it did not load the fallback (it did not report an error either). For cuda build, |
This patch is to address: open-mpi#11448 When Open MPI is compiled with CUDA support, comm->c_coll->coll_xxx_module is coll_cuda_module and HAN_LOAD_FALLBACK_COLLECTIVE is a no-op. As a result, HAN's collective functions can be called even if HAN has been disabled, which resulted an infinitely recursive calling loop. To address this issue, this patch make HAN's collective fucntion to call fallback function when HAN module was disabled. Signed-off-by: Wei Zhang <[email protected]>
Opened #11454 as a fix. |
This patch is to address: open-mpi#11448 When Open MPI is compiled with CUDA support, comm->c_coll->coll_xxx_module is coll_cuda_module and HAN_LOAD_FALLBACK_COLLECTIVE is a no-op. As a result, HAN's collective functions can be called even if HAN has been disabled, which resulted an infinitely recursive calling loop. To address this issue, this patch make HAN's collective fucntion to call fallback function when HAN module was disabled. Signed-off-by: Wei Zhang <[email protected]>
This patch is to address: open-mpi#11448 When Open MPI is compiled with CUDA support, comm->c_coll->coll_xxx_module is coll_cuda_module and HAN_LOAD_FALLBACK_COLLECTIVE is a no-op. As a result, HAN's collective functions can be called even if HAN has been disabled, which resulted an infinitely recursive calling loop. To address this issue, this patch make HAN's collective fucntion to call fallback function when HAN module was disabled. Signed-off-by: Wei Zhang <[email protected]> (cherry picked from commit ffab0a4)
fixed and backported |
This patch is to address: open-mpi#11448 When Open MPI is compiled with CUDA support, comm->c_coll->coll_xxx_module is coll_cuda_module and HAN_LOAD_FALLBACK_COLLECTIVE is a no-op. As a result, HAN's collective functions can be called even if HAN has been disabled, which resulted an infinitely recursive calling loop. To address this issue, this patch make HAN's collective fucntion to call fallback function when HAN module was disabled. Signed-off-by: Wei Zhang <[email protected]>
This patch is to address: open-mpi#11448 When Open MPI is compiled with CUDA support, comm->c_coll->coll_xxx_module is coll_cuda_module and HAN_LOAD_FALLBACK_COLLECTIVE is a no-op. As a result, HAN's collective functions can be called even if HAN has been disabled, which resulted an infinitely recursive calling loop. To address this issue, this patch make HAN's collective fucntion to call fallback function when HAN module was disabled. Signed-off-by: Wei Zhang <[email protected]>
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
compiled from source with the following coonfigure options:
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
running MPI_Allreduce() with cuda build of Open MPI, and 1 process per node will lead to segfault.
To reproduce, compile OSU Micro Benchmark with cuda support enabled
then run
osu_allreduce
using 1 process per nodeThe text was updated successfully, but these errors were encountered: