-
Notifications
You must be signed in to change notification settings - Fork 900
Weird alltoallw segfault when libcuda and btl smcuda are present #7460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Akshay-Venkatesh Are we seeing the same issue as reported in your #4650 (comment)? In our case it also involves mca_pml_ob1 (no idea what's that for), but we saw segfault instead of hanging. |
@leofang Seeing the error in detail now. For applications that don't use cuda buffers with MPI, you could try using an OpenMPI build that's configured with As for why smcuda is being picked up for host transfers -- based on a recent discussion with @jsquyres and @bwbarrett it seems that when smcuda is available, it's used for all intra-node transfers (be it host or gpu buffers) possibly due to higher btl priority than other btls. (answer to 2.) All that said, I'm not sure why smcuda is causing segfaults. smcuda is used for just host-to-host transfers in other settings (i.e just send/recv as opposed to alltoallw) where things work fine. This probably needs some investigation (non-answer to 1.) edit: Just verified that setting env var |
Thanks a lot, @Akshay-Venkatesh, for your thorough replies!
Ah, it's a real bummer to us then. The idea of turning on CUDA-awareness by default on conda-forge is to make Open MPI support both pure-CPU programs as usual and GPU programs when CUDA is present and used. One of the reasons that convinced us this is OK was Jeff mentioning that this strategy was the design decision made by Open MPI devs and is adopted in many heterogeneous environments. I wonder why no one else reported errors so far...
It'd be great if you could help us investigate a bit, Akshay. This is puzzling since the mpi4py test suite wasn't able to catch the error. Something is going on depending on whether smcuda is along the call path... On our side, I'll try to look into the use case in mpi4py-fft, and see if I could give you a minimum reproducer (and then fix mpi4py tests...) |
Just to report that I am experiencing similar problems on my Manjaro Linux 23.0.0 system (64-bit Intel Core i7-4702MQ CPU @ 2.20GHz). Any code that uses Here is the minimal code that triggers the problem: /**
* @author RookieHPC
* @brief Original source code at https://rookiehpc.org/mpi/docs/mpi_allgather/index.html
**/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(size != 3)
{
printf("This application is meant to be run with 3 MPI processes.\n");
MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
}
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int my_value = my_rank * 100;
printf("Process %d, my value = %d.\n", my_rank, my_value);
int buffer[3];
MPI_Allgather(&my_value, 1, MPI_INT, buffer, 1, MPI_INT, MPI_COMM_WORLD);
printf("Values collected on process %d: %d, %d, %d.\n", my_rank, buffer[0], buffer[1], buffer[2]);
MPI_Finalize();
return EXIT_SUCCESS;
} Here is the output if I run
And here is the output if I use
Sorry if this adds nothing new to the discussion, it was just something I wanted to report. |
I cannot replicate this with OMPI 4.1.x and OMPI 5.x-rc* (from git). Even if I force smcuda, the result is correct. $ mpirun --mca btl_base_verbose 5 -np 3 --mca pml ob1 --mca btl self,vader,smcuda ./allgather
[XXX:3342089] mca: bml: Using smcuda btl for send to [[23095,1],1] on node XXX
[XXX:3342089] mca: bml: Using smcuda btl for send to [[23095,1],2] on node XXX
[XXX:3342089] mca: bml: Using self btl for send to [[23095,1],0] on node XXX
[XXX:3342090] mca: bml: Using smcuda btl for send to [[23095,1],0] on node XXX
[XXX:3342090] mca: bml: Using smcuda btl for send to [[23095,1],2] on node XXX
[XXX:3342090] mca: bml: Using self btl for send to [[23095,1],1] on node XXX
[XXX:3342091] mca: bml: Using smcuda btl for send to [[23095,1],0] on node XXX
[XXX:3342091] mca: bml: Using smcuda btl for send to [[23095,1],1] on node XXX
[XXX:3342091] mca: bml: Using self btl for send to [[23095,1],2] on node XXX
Process 0, my value = 0.
Process 1, my value = 100.
Process 2, my value = 200.
Values collected on process 0: 0, 100, 200.
Values collected on process 1: 0, 100, 200.
Values collected on process 2: 0, 100, 200. |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Tested with both building from source myself as well as installing from the conda-forge channel. In both cases, the build time flag
--with-cuda
was set to turn on CUDA awareness.Please describe the system on which you are running
Linux (native) / Linux docker (with CUDA Toolkit and driver installed)
N/A
(single node)
Details of the problem
This is a summary for the original bug report on mpi4py-fft's issue tracker.
I was running the test suite of mpi4py-fft, and I noticed there's an
AssertionError
when testing with 2 processes:# in mpi4py-fft root $ mpirun -n 2 python tests/test_mpifft.py
and with 4 processes all nonsense started appearing with a segfault:
We realized it's due to the presence of the smcuda btl, which got activated because we had CUDA driver (libcuda) installed in our test environments, even though none of the code in mpi4py-fft uses GPU. So, by ejecting smcuda everything runs just fine:
# tested N = 1, 2, 4 $ mpirun -n N --mca btl ^smcuda python tests/test_mpifft.py
My questions:
alltoallw()
calls (likely from mpi4py-fft's Pencil code). Why doesalltoallw()
need smcuda even when we don't use GPU?The 3rd question is most urgent, as from conda-forge's maintenance viewpoint this means we probably shouldn't turn on CUDA awareness by default in our Open MPI package, otherwise all non-GPU users and downstream packages (like mpi4py-fft) are all affected.
ps. I should add that oddly mpi4py's test suite runs just fine with any
N
processes and without ejecting smcuda. We were unable to reproduce thealltoallw
segfault on the mpi4py side.ps2. For why and how CUDA-awareness was turned on in conda-forge's package, see conda-forge/openmpi-feedstock#42 and conda-forge/openmpi-feedstock#54; @jsquyres kindly offered help when we did that.
cc: @dalcinl @mikaem
The text was updated successfully, but these errors were encountered: