Skip to content

Weird alltoallw segfault when libcuda and btl smcuda are present #7460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
leofang opened this issue Feb 24, 2020 · 6 comments
Open

Weird alltoallw segfault when libcuda and btl smcuda are present #7460

leofang opened this issue Feb 24, 2020 · 6 comments
Assignees
Labels

Comments

@leofang
Copy link

leofang commented Feb 24, 2020

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Tested with both building from source myself as well as installing from the conda-forge channel. In both cases, the build time flag --with-cuda was set to turn on CUDA awareness.

Please describe the system on which you are running

  • Operating system/version:
    Linux (native) / Linux docker (with CUDA Toolkit and driver installed)
  • Computer hardware:
    N/A
  • Network type:
    (single node)

Details of the problem

This is a summary for the original bug report on mpi4py-fft's issue tracker.

I was running the test suite of mpi4py-fft, and I noticed there's an AssertionError when testing with 2 processes:

# in mpi4py-fft root
$ mpirun -n 2 python tests/test_mpifft.py

and with 4 processes all nonsense started appearing with a segfault:

$ mpirun -n 4 python tests/test_mpifft.py 
[xf03id-srv2:33127] *** Process received signal ***
[xf03id-srv2:33127] Signal: Segmentation fault (11)
[xf03id-srv2:33127] Signal code:  (128)
[xf03id-srv2:33127] Failing at address: (nil)
[xf03id-srv2:33129] *** Process received signal ***
[xf03id-srv2:33129] Signal: Segmentation fault (11)
[xf03id-srv2:33129] Signal code:  (128)
[xf03id-srv2:33129] Failing at address: (nil)
[xf03id-srv2:33127] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f53379fe890]
[xf03id-srv2:33127] [ 1] [xf03id-srv2:33129] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f2829250890]
[xf03id-srv2:33129] [ 1] /lib/x86_64-linux-gnu/libpthread.so.0(+0xb4d3)[0x7f282924c4d3]
[xf03id-srv2:33129] /lib/x86_64-linux-gnu/libpthread.so.0(+0xb4d3)[0x7f53379fa4d3]
[xf03id-srv2:33127] [ 2] [ 2] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_rndv+0x38f)[0x7f27f00cc8ef]
[xf03id-srv2:33129] [ 3] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10a66)[0x7f27f00c3a66]
[xf03id-srv2:33129] [ 4] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10c2d)[0x7f27f00c3c2d]
[xf03id-srv2:33129] [ 5] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x4ea)[0x7f27f1fb5c6a]
[xf03id-srv2:33129] [ 6] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_rndv+0x38f)[0x7f52f28208ef]
[xf03id-srv2:33127] [ 3] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10a66)[0x7f52f2817a66]
[xf03id-srv2:33127] [ 4] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_pml_ob1.so(+0x10c2d)[0x7f52f2817c2d]
[xf03id-srv2:33127] [ 5] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x4ea)[0x7f530079ac6a]
[xf03id-srv2:33127] [ 6] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f27fc95688c]
[xf03id-srv2:33129] [ 7] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f530b10488c]
[xf03id-srv2:33127] [ 7] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f530b10af65]
[xf03id-srv2:33127] [ 8] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f27fc95cf65]
[xf03id-srv2:33129] [ 8] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(ompi_request_default_wait_all+0x3bc)[0x7f530b6e251c]
[xf03id-srv2:33127] [ 9] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(ompi_request_default_wait_all+0x3bc)[0x7f27fcf3451c]
[xf03id-srv2:33129] [ 9] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x232)[0x7f27eafa8632]
/home/leofang/.openmpi-4.0.2_cuda_9.2/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x232)[0x7f52f17af632]
[xf03id-srv2:33127] [10] [xf03id-srv2:33129] [10] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(PMPI_Alltoallw+0x23d)[0x7f530b6f87fd]
[xf03id-srv2:33127] [11] /home/leofang/conda_envs/mpi4py-fft_dev2/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0x10c771)[0x7f530babd771]
[xf03id-srv2:33127] [12] python(_PyCFunction_FastCallDict+0x154)[0x55559d645b44]
[xf03id-srv2:33127] [13] /home/leofang/.openmpi-4.0.2_cuda_9.2/lib/libmpi.so.40(PMPI_Alltoallw+0x23d)[0x7f27fcf4a7fd]
[xf03id-srv2:33129] [11] /home/leofang/conda_envs/mpi4py-fft_dev2/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0x10c771)[0x7f27fd30f771]
[xf03id-srv2:33129] [12] python(_PyCFunction_FastCallDict+0x154)[0x55b6b887bb44]
[xf03id-srv2:33129] [13] python(+0x1a155c)[0x55559d6d355c]
[xf03id-srv2:33127] [14] python(+0x1a155c)[0x55b6b890955c]
[xf03id-srv2:33129] [14] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] [15] python(+0x171a5b)[0x55559d6a3a5b]
[xf03id-srv2:33127] [16] python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] [15] python(+0x171a5b)[0x55b6b88d9a5b]
[xf03id-srv2:33129] [16] python(+0x1a1635)[0x55559d6d3635]
[xf03id-srv2:33127] [17] python(+0x1a1635)[0x55b6b8909635]
[xf03id-srv2:33129] [17] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] [18] python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] [18] python(+0x170cf6)[0x55559d6a2cf6]
[xf03id-srv2:33127] [19] python(+0x170cf6)[0x55b6b88d8cf6]
[xf03id-srv2:33129] [19] python(_PyFunction_FastCallDict+0x1bc)[0x55559d6a416c]
[xf03id-srv2:33127] [20] python(_PyFunction_FastCallDict+0x1bc)[0x55b6b88da16c]
[xf03id-srv2:33129] [20] python(_PyObject_FastCallDict+0x26f)[0x55559d645f0f]
[xf03id-srv2:33127] [21] python(_PyObject_FastCallDict+0x26f)[0x55b6b887bf0f]
[xf03id-srv2:33129] [21] python(_PyObject_Call_Prepend+0x63)[0x55559d64ab33]
[xf03id-srv2:33127] [22] python(_PyObject_Call_Prepend+0x63)[0x55b6b8880b33]
[xf03id-srv2:33129] [22] python(PyObject_Call+0x3e)[0x55559d64594e]
[xf03id-srv2:33127] [23] python(PyObject_Call+0x3e)[0x55b6b887b94e]
[xf03id-srv2:33129] [23] python(+0x15cde7)[0x55559d68ede7]
[xf03id-srv2:33127] [24] python(+0x15cde7)[0x55b6b88c4de7]
[xf03id-srv2:33129] [24] python(_PyObject_FastCallDict+0x8b)[0x55559d645d2b]
[xf03id-srv2:33127] [25] python(_PyObject_FastCallDict+0x8b)[0x55b6b887bd2b]
[xf03id-srv2:33129] [25] python(+0x1a16ae)[0x55559d6d36ae]
[xf03id-srv2:33127] [26] python(+0x1a16ae)[0x55b6b89096ae]
[xf03id-srv2:33129] [26] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] [27] python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] [27] python(+0x171a5b)[0x55559d6a3a5b]
[xf03id-srv2:33127] [28] python(+0x171a5b)[0x55b6b88d9a5b]
[xf03id-srv2:33129] [28] python(+0x1a1635)[0x55559d6d3635]
[xf03id-srv2:33127] [29] python(+0x1a1635)[0x55b6b8909635]
[xf03id-srv2:33129] [29] python(_PyEval_EvalFrameDefault+0x30a)[0x55559d6f87aa]
[xf03id-srv2:33127] *** End of error message ***
python(_PyEval_EvalFrameDefault+0x30a)[0x55b6b892e7aa]
[xf03id-srv2:33129] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node xf03id-srv2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

We realized it's due to the presence of the smcuda btl, which got activated because we had CUDA driver (libcuda) installed in our test environments, even though none of the code in mpi4py-fft uses GPU. So, by ejecting smcuda everything runs just fine:

# tested N = 1, 2, 4 
$ mpirun -n N --mca btl ^smcuda python tests/test_mpifft.py

My questions:

  1. Is this a known problem with Open MPI's CUDA support?
  2. Based on the segfault trace, it seems smcuda was invoked during the alltoallw() calls (likely from mpi4py-fft's Pencil code). Why does alltoallw() need smcuda even when we don't use GPU?
  3. Is there a better fix other than ejecting smcuda? Could we apply a patch or set some env vars when building Open MPI?

The 3rd question is most urgent, as from conda-forge's maintenance viewpoint this means we probably shouldn't turn on CUDA awareness by default in our Open MPI package, otherwise all non-GPU users and downstream packages (like mpi4py-fft) are all affected.

ps. I should add that oddly mpi4py's test suite runs just fine with any N processes and without ejecting smcuda. We were unable to reproduce the alltoallw segfault on the mpi4py side.

ps2. For why and how CUDA-awareness was turned on in conda-forge's package, see conda-forge/openmpi-feedstock#42 and conda-forge/openmpi-feedstock#54; @jsquyres kindly offered help when we did that.

cc: @dalcinl @mikaem

@leofang
Copy link
Author

leofang commented Feb 24, 2020

  1. Is this a known problem with Open MPI's CUDA support?

@Akshay-Venkatesh Are we seeing the same issue as reported in your #4650 (comment)? In our case it also involves mca_pml_ob1 (no idea what's that for), but we saw segfault instead of hanging.

@jsquyres
Copy link
Member

FYI @Akshay-Venkatesh

@Akshay-Venkatesh
Copy link
Contributor

Akshay-Venkatesh commented Feb 24, 2020

@leofang Seeing the error in detail now.

For applications that don't use cuda buffers with MPI, you could try using an OpenMPI build that's configured with --with-cuda=no. This should prevent smcuda from interfering. This is the build time option. Alternatively, for a runtime option as you've figured out, you could pass --mca btl ^smcuda to blacklist smcuda from being picked up during application run. I believe passing exporting OMPI_MCA_opal_cuda_support=false should have the same effect but I've not checked this in a while. (answer to 3.)

As for why smcuda is being picked up for host transfers -- based on a recent discussion with @jsquyres and @bwbarrett it seems that when smcuda is available, it's used for all intra-node transfers (be it host or gpu buffers) possibly due to higher btl priority than other btls. (answer to 2.)

All that said, I'm not sure why smcuda is causing segfaults. smcuda is used for just host-to-host transfers in other settings (i.e just send/recv as opposed to alltoallw) where things work fine. This probably needs some investigation (non-answer to 1.)

edit: Just verified that setting env var OMPI_MCA_opal_cuda_support=false does have the desired effect of turning off cuda support.

@leofang
Copy link
Author

leofang commented Feb 25, 2020

Thanks a lot, @Akshay-Venkatesh, for your thorough replies!

For applications that don't use cuda buffers with MPI, you could try using an OpenMPI build that's configured with --with-cuda=no.

Ah, it's a real bummer to us then. The idea of turning on CUDA-awareness by default on conda-forge is to make Open MPI support both pure-CPU programs as usual and GPU programs when CUDA is present and used. One of the reasons that convinced us this is OK was Jeff mentioning that this strategy was the design decision made by Open MPI devs and is adopted in many heterogeneous environments. I wonder why no one else reported errors so far...

All that said, I'm not sure why smcuda is causing segfaults. smcuda is used for just host-to-host transfers in other settings (i.e just send/recv as opposed to alltoallw) where things work fine. This probably needs some investigation (non-answer to 1.)

It'd be great if you could help us investigate a bit, Akshay. This is puzzling since the mpi4py test suite wasn't able to catch the error. Something is going on depending on whether smcuda is along the call path...

On our side, I'll try to look into the use case in mpi4py-fft, and see if I could give you a minimum reproducer (and then fix mpi4py tests...)

@ziotom78
Copy link

Just to report that I am experiencing similar problems on my Manjaro Linux 23.0.0 system (64-bit Intel Core i7-4702MQ CPU @ 2.20GHz). Any code that uses allgather crashes with a segmentation fault, be it in Python or C/C++. The problem disappears once I set OMPI_MCA_opal_cuda_support=false.

Here is the minimal code that triggers the problem:

/**
 * @author RookieHPC
 * @brief Original source code at https://rookiehpc.org/mpi/docs/mpi_allgather/index.html
 **/

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);

    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    if(size != 3)
    {
        printf("This application is meant to be run with 3 MPI processes.\n");
        MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
    }

    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    int my_value = my_rank * 100;
    printf("Process %d, my value = %d.\n", my_rank, my_value);

    int buffer[3];
    MPI_Allgather(&my_value, 1, MPI_INT, buffer, 1, MPI_INT, MPI_COMM_WORLD);
    printf("Values collected on process %d: %d, %d, %d.\n", my_rank, buffer[0], buffer[1], buffer[2]);

    MPI_Finalize();

    return EXIT_SUCCESS;
}

Here is the output if I run mpiexec:

$ mpiexec --mca btl_base_verbose 5 -np 3 ./test                                                                                                                                               ─╯
[maurizio-tombook:05509] mca: bml: Using smcuda btl for send to [[37520,1],0] on node maurizio-tombook
[maurizio-tombook:05509] mca: bml: Using smcuda btl for send to [[37520,1],1] on node maurizio-tombook
[maurizio-tombook:05509] mca: bml: Using self btl for send to [[37520,1],2] on node maurizio-tombook
[maurizio-tombook:05508] mca: bml: Using smcuda btl for send to [[37520,1],0] on node maurizio-tombook
[maurizio-tombook:05508] mca: bml: Using smcuda btl for send to [[37520,1],2] on node maurizio-tombook
[maurizio-tombook:05508] mca: bml: Using self btl for send to [[37520,1],1] on node maurizio-tombook
[maurizio-tombook:05507] mca: bml: Using smcuda btl for send to [[37520,1],1] on node maurizio-tombook
[maurizio-tombook:05507] mca: bml: Using smcuda btl for send to [[37520,1],2] on node maurizio-tombook
[maurizio-tombook:05507] mca: bml: Using self btl for send to [[37520,1],0] on node maurizio-tombook
Process 1, my value = 100.
Process 2, my value = 200.
Process 0, my value = 0.
[maurizio-tombook:05507] *** Process received signal ***
[maurizio-tombook:05507] Signal: Segmentation fault (11)
[maurizio-tombook:05507] Signal code: Address not mapped (1)
[maurizio-tombook:05507] Failing at address: (nil)
[maurizio-tombook:05509] *** Process received signal ***
[maurizio-tombook:05509] Signal: Segmentation fault (11)
[maurizio-tombook:05509] Signal code: Address not mapped (1)
[maurizio-tombook:05509] Failing at address: (nil)
[maurizio-tombook:05508] *** Process received signal ***
[maurizio-tombook:05508] Signal: Segmentation fault (11)
[maurizio-tombook:05508] Signal code: Address not mapped (1)
[maurizio-tombook:05508] Failing at address: (nil)
[maurizio-tombook:05508] [ 0] [maurizio-tombook:05507] [ 0] [maurizio-tombook:05509] [ 0] /usr/lib/libc.so.6(+0x39ab0)[0x7f63f5d54ab0]
[maurizio-tombook:05508] *** End of error message ***
/usr/lib/libc.so.6(+0x39ab0)[0x7f101a2f7ab0]
[maurizio-tombook:05509] *** End of error message ***
/usr/lib/libc.so.6(+0x39ab0)[0x7f6e305b9ab0]
[maurizio-tombook:05507] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node maurizio-tombook exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

And here is the output if I use OMPI_MCA_opal_cuda_support=false:

OMPI_MCA_opal_cuda_support=false mpiexec --mca btl_base_verbose 5 -np 3 ./test                                                                                                              ─╯
[maurizio-tombook:05577] mca: bml: Using self btl for send to [[37419,1],1] on node maurizio-tombook
[maurizio-tombook:05578] mca: bml: Using self btl for send to [[37419,1],2] on node maurizio-tombook
[maurizio-tombook:05576] mca: bml: Using self btl for send to [[37419,1],0] on node maurizio-tombook
[maurizio-tombook:05577] mca: bml: Using vader btl for send to [[37419,1],0] on node maurizio-tombook
[maurizio-tombook:05577] mca: bml: Using vader btl for send to [[37419,1],2] on node maurizio-tombook
[maurizio-tombook:05578] mca: bml: Using vader btl for send to [[37419,1],0] on node maurizio-tombook
[maurizio-tombook:05578] mca: bml: Using vader btl for send to [[37419,1],1] on node maurizio-tombook
[maurizio-tombook:05576] mca: bml: Using vader btl for send to [[37419,1],1] on node maurizio-tombook
[maurizio-tombook:05576] mca: bml: Using vader btl for send to [[37419,1],2] on node maurizio-tombook
Process 2, my value = 200.
Process 1, my value = 100.
Process 0, my value = 0.
Values collected on process 2: 0, 100, 200.
Values collected on process 0: 0, 100, 200.
Values collected on process 1: 0, 100, 200.

Sorry if this adds nothing new to the discussion, it was just something I wanted to report.

@bosilca
Copy link
Member

bosilca commented Jul 13, 2023

I cannot replicate this with OMPI 4.1.x and OMPI 5.x-rc* (from git). Even if I force smcuda, the result is correct.

$ mpirun --mca btl_base_verbose 5  -np 3 --mca pml ob1 --mca btl self,vader,smcuda ./allgather
[XXX:3342089] mca: bml: Using smcuda btl for send to [[23095,1],1] on node XXX
[XXX:3342089] mca: bml: Using smcuda btl for send to [[23095,1],2] on node XXX
[XXX:3342089] mca: bml: Using self btl for send to [[23095,1],0] on node XXX
[XXX:3342090] mca: bml: Using smcuda btl for send to [[23095,1],0] on node XXX
[XXX:3342090] mca: bml: Using smcuda btl for send to [[23095,1],2] on node XXX
[XXX:3342090] mca: bml: Using self btl for send to [[23095,1],1] on node XXX
[XXX:3342091] mca: bml: Using smcuda btl for send to [[23095,1],0] on node XXX
[XXX:3342091] mca: bml: Using smcuda btl for send to [[23095,1],1] on node XXX
[XXX:3342091] mca: bml: Using self btl for send to [[23095,1],2] on node XXX
Process 0, my value = 0.
Process 1, my value = 100.
Process 2, my value = 200.
Values collected on process 0: 0, 100, 200.
Values collected on process 1: 0, 100, 200.
Values collected on process 2: 0, 100, 200.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants