Skip to content

Collective hanging ibm/allgather on main branch #10318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wckzhang opened this issue Apr 25, 2022 · 10 comments
Closed

Collective hanging ibm/allgather on main branch #10318

wckzhang opened this issue Apr 25, 2022 · 10 comments

Comments

@wckzhang
Copy link
Contributor

Open MPI main branch (918fe01)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

e4c20e2 3rd-party/openpmix (v1.1.3-3506-ge4c20e22)
9ae73d4d97f843fac994103f2232f6570baaba26 3rd-party/prrte (psrvr-v2.0.0rc1-4350-g9ae73d4d97)

Please describe the system on which you are running

Amazon Linux 2


Details of the problem

Open MPI main branch hangs when running the following ompi-tests test:

mpirun -np 2 -N 1 --mca btl tcp  --mca coll_base_verbose 100  --hostfile ~/hostfile -x FI_LOG_LEVEL=warn ~/ompi-tests/ibm/collective/intercomm/allgather_inter

I'm pretty sure it's also using the basic collective component which doesn't seem to be intended, maybe we have a bug in component selection.

@bosilca
Copy link
Member

bosilca commented Apr 26, 2022

what happens if you increase the list of allowed BTL to "self,sm,tcp" ?

@bosilca
Copy link
Member

bosilca commented Apr 26, 2022

also the test name implies an allgather into an intercomm, I would not be surprised if the only algorithm is provided by basic.

@bwbarrett
Copy link
Member

I wouldn't expect --mca btl tcp to work, since the TCP BTL doesn't support send-to-self. But I would have expected an error if you tried to send to self. Are you sure it's using the OB1 PML at all, since you're exporting FI_LOG_LEVEL, I assume there's a Libfabric build available?

@wckzhang
Copy link
Contributor Author

You're correct @bwbarrett I ran with --mca mtl ^ofi and it failed with a different error. That means my original issue had a hang with the ofi mtl.

However, I also ran with "self, sm, tcp" btl's and it hung as well, so I don't think this is related to the type of communication (mtl/btl), but probably a collective or some other issue:

[ec2-user@ip-10-0-0-73 ~]$ ~/colltest/bin/mpirun -np 2 -N 1 --mca btl "tcp,self,sm" --mca pml_base_verbose 100 --mca pml ob1 --mca mtl ^ofi  --hostfile ~/hostfile ~/ompi-tests/ibm/collective/intercomm/allgather_inter
Warning: Permanently added 'compute-st-c5n18xlarge-1,10.0.1.54' (ECDSA) to the list of known hosts.
Warning: Permanently added 'compute-st-c5n18xlarge-2,10.0.1.83' (ECDSA) to the list of known hosts.
[compute-st-c5n18xlarge-2:15256] mca: base: components_register: registering framework pml components
[compute-st-c5n18xlarge-2:15256] mca: base: components_register: found loaded component ob1
[compute-st-c5n18xlarge-2:15256] mca: base: components_register: component ob1 register function successful
[compute-st-c5n18xlarge-2:15256] mca: base: components_open: opening pml components
[compute-st-c5n18xlarge-2:15256] mca: base: components_open: found loaded component ob1
[compute-st-c5n18xlarge-2:15256] mca: base: components_open: component ob1 open function successful
[compute-st-c5n18xlarge-2:15256] select: initializing pml component ob1
[compute-st-c5n18xlarge-2:15256] select: init returned priority 20
[compute-st-c5n18xlarge-2:15256] selected ob1 best priority 20
[compute-st-c5n18xlarge-2:15256] select: component ob1 selected
[compute-st-c5n18xlarge-1:06182] mca: base: components_register: registering framework pml components
[compute-st-c5n18xlarge-1:06182] mca: base: components_register: found loaded component ob1
[compute-st-c5n18xlarge-1:06182] mca: base: components_register: component ob1 register function successful
[compute-st-c5n18xlarge-1:06182] mca: base: components_open: opening pml components
[compute-st-c5n18xlarge-1:06182] mca: base: components_open: found loaded component ob1
[compute-st-c5n18xlarge-1:06182] mca: base: components_open: component ob1 open function successful
[compute-st-c5n18xlarge-1:06182] select: initializing pml component ob1
[compute-st-c5n18xlarge-1:06182] select: init returned priority 20
[compute-st-c5n18xlarge-1:06182] selected ob1 best priority 20
[compute-st-c5n18xlarge-1:06182] select: component ob1 selected
[compute-st-c5n18xlarge-1:06182] check:select: PML check not necessary on self
[compute-st-c5n18xlarge-2:15256] check:select: checking my pml ob1 against process [[34634,1],0] pml ob1
--- hangs here ---

Note, this also successfully finishes the test if I change it to -np 4 -n 2.
I'll try to take a look at the code/also see if this is an issue that exists in 4.1.x when I can.

@jjhursey
Copy link
Member

I think this is the same as #8958 - the problem seems to be with PMIx connect/accept in v5.0.x and main.

@bwbarrett
Copy link
Member

Adding a 5.0 tag, since wei reports this happens in 5.0 as well.

@awlauria
Copy link
Contributor

This seems to be a dup of #8958 - should we close this one or the other one?

@wckzhang
Copy link
Contributor Author

Doesn't matter we can close this one if it's convenient for tracking

@wzamazon
Copy link
Contributor

I am positive this is the same as #8958, and I have a root cause.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants