Skip to content

Main and v5.0.x hang - prrte #1839 #12064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wenduwan opened this issue Nov 13, 2023 · 0 comments · Fixed by #12101
Closed

Main and v5.0.x hang - prrte #1839 #12064

wenduwan opened this issue Nov 13, 2023 · 0 comments · Fixed by #12101

Comments

@wenduwan
Copy link
Contributor

wenduwan commented Nov 13, 2023

Background information

We have observed a hang behavior in mpirun since at least Aug. 2023(and likely earlier) after the application completes. The issue happens at a 5-10% chance, and reliably reproducible.

The issue does not happen with Open MPI 4.1.x branch.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Main and v5.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source:

./configure --enable-prte-prefix-by-default --enable-debug

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Tried builtin pointers and prrte master branch.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04(and many other operating systems, e.g. Debian 11, Amazon Linux2, RHEL7, etc.)
  • Computer hardware: AWS c5n.18xlarge(x86_64) c6gn.16xlarge(aarch64)
  • Network type: Single node shared memory(--mca pml ob1)

Details of the problem

We can reproduce the issue with a few osu microbenchmarks collectives benchmarks. It is easier to reproduce with a higher core count platform, e.g. we typically test on 64 cores.

mpirun --map-by ppr:64:node --tag-output \
	--mca state_base_verbose 10 \
	--mca odls_base_verbose 10 \
	--mca plm_base_verbose 10 \
	--mca pml ob1 \
	osu-micro-benchmarks/mpi/collective/osu_reduce

We observe that the application, i.e. osu_reduce, completes normally and verified that MPI_Finalize is called successfully by all participant processes. However, the mpirun command get stuck at the end of the benchmark.

References

More details see: openpmix/prrte#1839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant