Skip to content

Disabling tcp btl causes issues for rdma osc when running IMB-RMA #9630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dshrader opened this issue Nov 5, 2021 · 8 comments
Open

Disabling tcp btl causes issues for rdma osc when running IMB-RMA #9630

dshrader opened this issue Nov 5, 2021 · 8 comments
Assignees

Comments

@dshrader
Copy link

dshrader commented Nov 5, 2021

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.1
Master nightly tarball from Nov. 4th

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from release tarballs; only the --prefix=... configure parameter was used

Please describe the system on which you are running

  • Operating system/version: based on RHEL 7
  • Computer hardware: Intel
  • Network type: Mellanox (ConnectX-5) with UCX 1.9.0

Details of the problem

I normally disable the tcp btl in order to have jobs fail with an obvious error if there is a problem with the high speed interconnect's communication library. This way I know when optimal performance isn't happening. I have found, however, that disabling the tcp btl causes the rdma osc to cause problems when trying to run IMB-RMA (Intel MPI Benchmark's RMA). When that is done, a message like this occurs:

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[9281,1],4]) is on host: ko002
  Process 2 ([[9281,1],0]) is on host: ko001
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

This happens in both 4.1.1 and the nightly tarball from Nov 4th for master. If I enable the tcp btl, or disable the rdma osc, IMB-RMA runs just fine. I can also force the use of the ucx osc, and that seems to get IMB-RMA to work as well.

I am currently running IMB-RMA like the following:

# Fails
$> mpirun --map-by ppr:1:node --mca btl '^tcp' ./IMB-RMA
# Works
$> mpirun --map-by ppr:1:node --mca btl '^tcp' --mca osc '^rdma' ./IMB-RMA
$> mpirun --map-by ppr:1:node --mca btl '^tcp' --mca osc ucx ./IMB-RMA

In all cases, the ucx osc component is chosen, at least according to the output when I use --mca osc_base_verbose 100. So, it seems that the initialization of the rdma osc component has issues when the tcp btl is missing. Is this expected?

Thanks,
David

@dshrader
Copy link
Author

dshrader commented Nov 8, 2021

In talking with @hppritcha, this may be about the osc rdma component complaining about the lack of an inter-node btl being available. ompi_info for the master nightly tarball does list a uct btl, but that component is not referenced in the error message. I'll have to see if the uct btl is being tried at all at runtime.

@dshrader
Copy link
Author

dshrader commented Nov 8, 2021

According to the verbose btl output, the uct btl is attempted, but init of it failed:

select: init of component uct returned failure

So, if I turn off tcp, then there really are no btls for inter-node transport available. I am not able to figure out why the uct btl fails to init, however, so I don't know what else to try to test the theory of no inter-node btls being available as the root cause of the error when the osc rdma component is allowed to init.

@hjelmn
Copy link
Member

hjelmn commented Nov 8, 2021

Hey, UCT fails to enable because we don't set a default memory domain (this is changing). I have been out of the loop on LANL systems for some time. Is this a new system with Mellanox HCAs? mlx5?

@dshrader
Copy link
Author

dshrader commented Nov 8, 2021

@hjelmn, this machine has been around for three or four years... It has Mellanox ConnectX-5 cards. lspci says the card is in the MT27800 family.

@hjelmn
Copy link
Member

hjelmn commented Nov 8, 2021

Try with --mca btl_uct_memory_domains mlx5_0,mlx4_0.

@dshrader
Copy link
Author

Using the master nightly install, I ran the following:

$> mpirun --map-by ppr:1:node --mca btl '^tcp' --mca btl_uct_memory_domains mlx5_0,mlx4_0 ./IMB-RMA

This successfully started IMB-RMA, whereas leaving off the btl_uct_memory_domains option has it quit right at the start with the message about not being able to communicate between the two ranks. However, it later seg-faulted, which it doesn't do when I use --mca osc '^rdma'. That may be a separate issue... This does suggest that the original error for this issue does look to be simply about rdma complaining when it doesn't have a btl that can do inter-node communication.

Would it be possible to have the rdma osc component not kill the MPI job when it isn't being used? In all test cases I have done, the ucx osc component gets chosen, so there seems to be no need for osc rdma to be killing the MPI job.

@dshrader
Copy link
Author

Looks like letting the rdma osc component init is somehow behind the segmentation fault. The following invocation leads to a seg fault in IMB-RMA:

$> mpirun --map-by ppr:1:node --mca btl '^tcp' --mca btl_uct_memory_domains mlx5_0,mlx4_0 ./IMB-RMA

The following invocation has IMB-RMA run fine:

$> mpirun --map-by ppr:1:node --mca btl '^tcp' --mca btl_uct_memory_domains mlx5_0,mlx4_0 --mca osc '^rdma' ./IMB-RMA

I thought that perhaps having the uct btl successfully init might lead to the seg fault, but the above implies it isn't. At least, it isn't to blame on its own. The rdma osc seems to be the easy toggle to get the seg fault.

@dshrader
Copy link
Author

dshrader commented Dec 2, 2021

Is this related to #7830? I may have misunderstood what #7830 is dealing with, but it seems quite similar. That is, the rdma osc having issues when btls aren't communicating with all ranks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants