Weird alltoallw segfault when libcuda and btl smcuda are present

## Background information

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.2

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Tested with both building from source myself as well as installing from the conda-forge channel. In both cases, the build time flag `--with-cuda` was set to turn on CUDA awareness.

### Please describe the system on which you are running

* Operating system/version: 
Linux (native) / Linux docker (with CUDA Toolkit and driver installed)
* Computer hardware: 
N/A
* Network type: 
(single node)

-----------------------------

## Details of the problem

This is a summary for [the original bug report on mpi4py-fft's issue tracker](https://bitbucket.org/mpi4py/mpi4py-fft/issues/15/tests-failed-in-test_mpifftpy).

I was running the test suite of mpi4py-fft, and I noticed there's an `AssertionError` when testing with 2 processes:
```bash
# in mpi4py-fft root
$ mpirun -n 2 python tests/test_mpifft.py
```
and with 4 processes all nonsense started appearing with a segfault:
```bash
$ mpirun -n 4 python tests/test_mpifft.py 
[xf03id-srv2:33127] *** Process received signal ***
[xf03id-srv2:33127] Signal: Segmentation fault (11)
[xf03id-srv2:33127] Signal code:  (128)
[xf03id-srv2:33127] Failing at address: (nil)
[xf03id-srv2:33129] *** Process received signal ***
[xf03id-srv2:33129] Signal: Segmentation fault (11)
[xf03id-srv2:33129] Signal code:  (128)
[xf03id-srv2:33129] Failing at address: (nil)
[xf03id-srv2:33127] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f53379fe890]
[xf03id-srv2:33127] [ 1] [xf03id-srv2:33129] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f2829250890]
[xf03id-srv2:33129] [ 1] /lib/x86_64-linux-gnu/libpthread.so.0(+0xb4d3)[0x7f282924c4d3]
[xf03id-srv2:33129] /lib/x86_64-linux-gnu/libpthread.so.0(+0xb4d3)[0x7f53379fa4d3]
[xf03id-srv2:33127] [ 2] [ 2] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_rndv+0x38f)[0x7f27f00cc8ef]
[xf03id-srv2:33129] [ 3] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10a66)[0x7f27f00c3a66]
[xf03id-srv2:33129] [ 4] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10c2d)[0x7f27f00c3c2d]
[xf03id-srv2:33129] [ 5] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x4ea)[0x7f27f1fb5c6a]
[xf03id-srv2:33129] [ 6] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_rndv+0x38f)[0x7f52f28208ef]
[xf03id-srv2:33127] [ 3] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10a66)[0x7f52f2817a66]
[xf03id-srv2:33127] [ 4] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10c2d)[0x7f52f2817c2d]
[xf03id-srv2:33127] [ 5] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x4ea)[0x7f530079ac6a]
[xf03id-srv2:33127] [ 6] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f27fc95688c]
[xf03id-srv2:33129] [ 7] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f530b10488c]
[xf03id-srv2:33127] [ 7] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f530b10af65]
[xf03id-srv2:33127] [ 8] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f27fc95cf65]
[xf03id-srv2:33129] [ 8] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(ompi_request_default_wait_all+0x3bc)[0x7f530b6e251c]
[xf03id-srv2:33127] [ 9] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(ompi_request_default_wait_all+0x3bc)[0x7f27fcf3451c]
[xf03id-srv2:33129] [ 9] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x232)[0x7f27eafa8632]
/home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x232)[0x7f52f17af632]
[xf03id-srv2:33127] [10] [xf03id-srv2:33129] [10] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(PMPI_Alltoallw+0x23d)[0x7f530b6f87fd]
[xf03id-srv2:33127] [11] /home/leofang/conda_envs/mpi4py-fft_dev2/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0x10c771)[0x7f530babd771]
[xf03id-srv2:33127] [12] python(_PyCFunction_FastCallDict+0x154)[0x55559d645b44]
[xf03id-srv2:33127] [13] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(PMPI_Alltoallw+0x23d)[0x7f27fcf4a7fd]
[xf03id-srv2:33129] [11] /home/leofang/conda_envs/mpi4py-fft_dev2/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0x10c771)[0x7f27fd30f771]
[xf03id-srv2:33129] [12] python(_PyCFunction_FastCallDict+0x154)[0x55b6b887bb44]
[xf03id-srv2:33129] [13] python(+0x1a155c)[0x55559d6d355c]
[xf03id-srv2:33127] [14] python(+0x1a155c)[0x55b6b890955c]
[xf03id-srv2:33129] [14] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] [15] python(+0x171a5b)[0x55559d6a3a5b]
[xf03id-srv2:33127] [16] python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] [15] python(+0x171a5b)[0x55b6b88d9a5b]
[xf03id-srv2:33129] [16] python(+0x1a1635)[0x55559d6d3635]
[xf03id-srv2:33127] [17] python(+0x1a1635)[0x55b6b8909635]
[xf03id-srv2:33129] [17] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] [18] python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] [18] python(+0x170cf6)[0x55559d6a2cf6]
[xf03id-srv2:33127] [19] python(+0x170cf6)[0x55b6b88d8cf6]
[xf03id-srv2:33129] [19] python(_PyFunction_FastCallDict+0x1bc)[0x55559d6a416c]
[xf03id-srv2:33127] [20] python(_PyFunction_FastCallDict+0x1bc)[0x55b6b88da16c]
[xf03id-srv2:33129] [20] python(_PyObject_FastCallDict+0x26f)[0x55559d645f0f]
[xf03id-srv2:33127] [21] python(_PyObject_FastCallDict+0x26f)[0x55b6b887bf0f]
[xf03id-srv2:33129] [21] python(_PyObject_Call_Prepend+0x63)[0x55559d64ab33]
[xf03id-srv2:33127] [22] python(_PyObject_Call_Prepend+0x63)[0x55b6b8880b33]
[xf03id-srv2:33129] [22] python(PyObject_Call+0x3e)[0x55559d64594e]
[xf03id-srv2:33127] [23] python(PyObject_Call+0x3e)[0x55b6b887b94e]
[xf03id-srv2:33129] [23] python(+0x15cde7)[0x55559d68ede7]
[xf03id-srv2:33127] [24] python(+0x15cde7)[0x55b6b88c4de7]
[xf03id-srv2:33129] [24] python(_PyObject_FastCallDict+0x8b)[0x55559d645d2b]
[xf03id-srv2:33127] [25] python(_PyObject_FastCallDict+0x8b)[0x55b6b887bd2b]
[xf03id-srv2:33129] [25] python(+0x1a16ae)[0x55559d6d36ae]
[xf03id-srv2:33127] [26] python(+0x1a16ae)[0x55b6b89096ae]
[xf03id-srv2:33129] [26] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] [27] python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] [27] python(+0x171a5b)[0x55559d6a3a5b]
[xf03id-srv2:33127] [28] python(+0x171a5b)[0x55b6b88d9a5b]
[xf03id-srv2:33129] [28] python(+0x1a1635)[0x55559d6d3635]
[xf03id-srv2:33127] [29] python(+0x1a1635)[0x55b6b8909635]
[xf03id-srv2:33129] [29] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] *** End of error message ***
python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node xf03id-srv2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
```
We realized it's due to the presence of the smcuda btl, which got activated because we had CUDA driver (libcuda) installed in our test environments, even though none of the code in mpi4py-fft uses GPU. So, by ejecting smcuda everything runs just fine:
```bash
# tested N = 1, 2, 4 
$ mpirun -n N --mca btl ^smcuda python tests/test_mpifft.py
```

My questions:
1. Is this a known problem with Open MPI's CUDA support?
2. Based on the segfault trace, it seems smcuda was invoked during the `alltoallw()` calls (likely from mpi4py-fft's Pencil code). Why does `alltoallw()` need smcuda even when we don't use GPU?
3. Is there a better fix other than ejecting smcuda? Could we apply a patch or set some env vars when building Open MPI?

**The 3rd question is most urgent**, as from conda-forge's maintenance viewpoint this means we probably shouldn't turn on CUDA awareness by default in our Open MPI package, otherwise all non-GPU users and downstream packages (like mpi4py-fft) are all affected.

ps. I should add that oddly mpi4py's test suite runs just fine with any `N` processes and without ejecting smcuda. We were unable to reproduce the `alltoallw` segfault on the mpi4py side.

ps2. For why and how CUDA-awareness was turned on in conda-forge's package, see conda-forge/openmpi-feedstock#42 and conda-forge/openmpi-feedstock#54; @jsquyres kindly offered help when we did that.

cc: @dalcinl @mikaem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weird alltoallw segfault when libcuda and btl smcuda are present #7460

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weird alltoallw segfault when libcuda and btl smcuda are present #7460

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions