-
Notifications
You must be signed in to change notification settings - Fork 900
Collective hanging ibm/allgather on main branch #10318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what happens if you increase the list of allowed BTL to "self,sm,tcp" ? |
also the test name implies an allgather into an intercomm, I would not be surprised if the only algorithm is provided by basic. |
I wouldn't expect --mca btl tcp to work, since the TCP BTL doesn't support send-to-self. But I would have expected an error if you tried to send to self. Are you sure it's using the OB1 PML at all, since you're exporting FI_LOG_LEVEL, I assume there's a Libfabric build available? |
You're correct @bwbarrett I ran with --mca mtl ^ofi and it failed with a different error. That means my original issue had a hang with the ofi mtl. However, I also ran with "self, sm, tcp" btl's and it hung as well, so I don't think this is related to the type of communication (mtl/btl), but probably a collective or some other issue:
Note, this also successfully finishes the test if I change it to -np 4 -n 2. |
I think this is the same as #8958 - the problem seems to be with PMIx connect/accept in v5.0.x and main. |
Adding a 5.0 tag, since wei reports this happens in 5.0 as well. |
This seems to be a dup of #8958 - should we close this one or the other one? |
Doesn't matter we can close this one if it's convenient for tracking |
I am positive this is the same as #8958, and I have a root cause. Closing. |
Open MPI main branch (918fe01)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.e4c20e2 3rd-party/openpmix (v1.1.3-3506-ge4c20e22)
9ae73d4d97f843fac994103f2232f6570baaba26 3rd-party/prrte (psrvr-v2.0.0rc1-4350-g9ae73d4d97)
Please describe the system on which you are running
Amazon Linux 2
Details of the problem
Open MPI main branch hangs when running the following ompi-tests test:
I'm pretty sure it's also using the basic collective component which doesn't seem to be intended, maybe we have a bug in component selection.
The text was updated successfully, but these errors were encountered: