Skip to content

OMPI 5.0.x branch coll HAN introduces a circular dependency when disqualifying itself #11448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wzamazon opened this issue Feb 27, 2023 · 5 comments

Comments

@wzamazon
Copy link
Contributor

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

compiled from source with the following coonfigure options:

./configure --prefix=/xxx/openmpi/v5.0.x/install --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --with-libevent=external --with-hwloc=external --enable-cuda --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --disable-builtin-atomics

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

7f6f8db13b42916b27b690b8a3f9e2757ec1417f 3rd-party/openpmix (v4.2.3-8-g7f6f8db1)
 c7b2c715f92495637c298249deb5493e86864ac8 3rd-party/prrte (v3.0.1rc1-36-gc7b2c715f9)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2
  • Computer hardware: AMD EPYC 7R13
  • Network type: EFA

Details of the problem

running MPI_Allreduce() with cuda build of Open MPI, and 1 process per node will lead to segfault.

To reproduce, compile OSU Micro Benchmark with cuda support enabled

./configure --prefix=/openmpi-v5.0.0rc10/install CC=/openmpi/v5.0.x/install/bin/mpicc CXX=/openmpi/v5.0.x/install/bin/mpicxx --with-cuda=/usr/local/cuda --enable-cuda

then run osu_allreduce using 1 process per node

/openmpi/v5.0.x/install/bin/mpirun \
        -n 2 --hostfile 2instances \
        --map-by ppr:1:node \
        -x FI_HMEM_CUDA_ENABLE_XFER=1 \
        -x PATH \
        /omb/openmpi-v5.0.0rc10/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce
@wzamazon
Copy link
Contributor Author

I will be working on this.

@wzamazon wzamazon self-assigned this Feb 27, 2023
@wzamazon
Copy link
Contributor Author

What happened was an recursive calling loop between mca_coll_cuda_allreduce and mca_coll_han_allreduce.

Which Open MPI is build with CUDA support, MPI_Allreduce will call mca_coll_cuda_allreduce

mca_coll_cuda_allreduce call the HAN module's mca_coll_han_allreduce.

When there is one 1 process per node, HAN will disqualify itself, and use fallback. code

In this case, the fallback is mca_coll_cuda_allreduce, which caused the infinite recursive calling loop.

@wzamazon
Copy link
Contributor Author

wzamazon commented Feb 28, 2023

The issue is the macro HAN_LOAD_FALLBACK_COLLECTIVE does not handle the case that comm->c_coll->coll_xxx_moudle is not HAN code,

so it did not load the fallback (it did not report an error either).

For cuda build, comm->c_coll->coll_xxx_module is always cuda_coll_module

wzamazon added a commit to wzamazon/ompi that referenced this issue Feb 28, 2023
This patch is to address:
    open-mpi#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and
HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called
even if HAN has been disabled, which resulted an infinitely
recursive calling loop.

To address this issue, this patch make HAN's collective
fucntion to call fallback function when HAN module was
disabled.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon
Copy link
Contributor Author

Opened #11454 as a fix.

wzamazon added a commit to wzamazon/ompi that referenced this issue Mar 2, 2023
This patch is to address:
    open-mpi#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and
HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called
even if HAN has been disabled, which resulted an infinitely
recursive calling loop.

To address this issue, this patch make HAN's collective
fucntion to call fallback function when HAN module was
disabled.

Signed-off-by: Wei Zhang <[email protected]>
@wckzhang wckzhang changed the title OMPI 5.0.x branch CUDA build MPI_Allreduce segfault with 1 process per rank OMPI 5.0.x branch coll HAN introduces a circular dependency when disqualifying itself Mar 7, 2023
wzamazon added a commit to wzamazon/ompi that referenced this issue Mar 8, 2023
This patch is to address:
    open-mpi#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and
HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called
even if HAN has been disabled, which resulted an infinitely
recursive calling loop.

To address this issue, this patch make HAN's collective
fucntion to call fallback function when HAN module was
disabled.

Signed-off-by: Wei Zhang <[email protected]>
(cherry picked from commit ffab0a4)
@wzamazon
Copy link
Contributor Author

fixed and backported

boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023
This patch is to address:
    open-mpi#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and
HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called
even if HAN has been disabled, which resulted an infinitely
recursive calling loop.

To address this issue, this patch make HAN's collective
fucntion to call fallback function when HAN module was
disabled.

Signed-off-by: Wei Zhang <[email protected]>
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
This patch is to address:
    open-mpi#11448

When Open MPI is compiled with CUDA support,
comm->c_coll->coll_xxx_module is coll_cuda_module and
HAN_LOAD_FALLBACK_COLLECTIVE is a no-op.

As a result, HAN's collective functions can be called
even if HAN has been disabled, which resulted an infinitely
recursive calling loop.

To address this issue, this patch make HAN's collective
fucntion to call fallback function when HAN module was
disabled.

Signed-off-by: Wei Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants