You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that I added coll-cuda to the list of mca-dsos. I'm not sure if it is intentionally missing in the documentation. I also tried without coll-cuda first, but with the same outcome.
CUDA Toolkit version 12.3 was installed in CUDA_ROOT. UCX was built against that CUDA toolkit. On cluster nodes with the drivers installed, ucx_info -d reports the relevant CUDA and gdrcopy transports.
Remark: The host used for compilation has the CUDA toolkit and runtime installed, but not the driver. So using stubs appears to be the way to go in that case (see #12264)
Please describe the system on which you are running
Operating system/version: Rocky Linux 8.8
Computer hardware: Intel Xeon
Network type: InfiniBand
Details of the problem
With Open MPI 4.1.4,I was able to build it such that one could compile and run binaries without the need of having the CUDA toolkit, runtime and drivers available on the node in use. However, with 5.0.1 configured as shown above, the linker warns about missing libcudart when building a binary (even a basic MPI_Init/MPI_Finalize program):
$ mpicc -show hw.c -o hw
gcc hw.c -o hw -I/path/to/openmpi/include -pthread -L/path/to/openmpi/lib -Wl,-rpath -Wl,/path/to/openmpi/lib -Wl,--enable-new-dtags -lmpi
$ mpicc hw.c -o hw
/usr/bin/ld: warning: libcudart.so.12, needed by /path/to/openmpi/lib/libmpi.so, not found (try using -rpath or -rpath-link)
$ mpirun -n1 ./hw
./hw: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory
$ ldd hw
linux-vdso.so.1 (0x00007ffc747da000)
libmpi.so.40 => /path/to/openmpi/lib/libmpi.so.40 (0x000014ae23df9000)
[...]
libcudart.so.12 => not found
With 4.1.4 I am able to compile and launch without those warnings/errors while having a CUDA-aware MPI. For 4.1.4 it was not the case that libmpi depends on libcudart, although 4.1.4 was configured using --with-cuda=....
If I got the SC'23 BoF slides correct, I understand that with 5.x Open MPI intends to integrate (link?) plugins directly into libmpi. But with the enable-mca-dso configure option I tried to put all CUDA related components into DSOs and thus away from libmpi. Nevertheless, libmpi has libcudart as a shared library dependency (see above). I also checked the symbols which libmpi needs but it does not appear to require any stuff from libcudart:
$ nm -D /path/to/openmpi/lib/libmpi.so.40 | grep -i cuda
000000000029cdb0 T mca_pml_ob1_rdma_cuda_btls
00000000002c7e20 T MPIX_Query_cuda_support
U opal_built_with_cuda_support
U opal_cuda_support
So it appears to me that libmpi unnecessarily depends on libcudart. Is there some bug in the configure/compilation process or is it not possible anymore to build Open MPI libraries such that one can compile applications without CUDA runtime libraries being available? Given the dependency to libcudart of libmpi the statement from the documentation
Open MPI supports building with CUDA libraries and running on systems without CUDA libraries or hardware.
does not appear to apply here. Or is there something wrong on my side?
Btw: The test program from the documentation may also deserve a call to MPI_Init in case one follows the DSO approach. Otherwise, it is reported that there is no CUDA support (using OMPI v5.0.1 with CUDA toolkit 12.3 available for compilation/execution):
$ ./check # with MPI_Init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library has CUDA-aware support.
$ ./check-no-init # without MPI_Init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library does not have CUDA-aware support.
The text was updated successfully, but these errors were encountered:
@janjust Thanks for your input. I can confirm that adding io-romio341 to the list of MCA DSOs removes the dependency on libcudart from libmpi.
I'm not sure how obvious this is to others, so I suggest to add the full list (see above?!) to the documentation.
Besides that, with Open MPI build like that the check code falsely reports that CUDA support is missing without MPI_Init - even on a node with CUDA runtime/driver installed. Having added the initial MPI call, everything works as expected:
non-gpu-node $ ./check-with-init
Compile time check:
This MPI library has CUDA-aware support.
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_accelerator_cuda: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_rcache_rgpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_rcache_gpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[non-gpu-node:2426494] mca_base_component_repository_open: unable to open mca_btl_smcuda: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
Run time check:
This MPI library does not have CUDA-aware support.
gpu-node $ ./check-with-init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library has CUDA-aware support.
gpu-node $ ./check-no-init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library does not have CUDA-aware support.
I agree, in the meantime I'll make a feature issue request out of this.
janjust
changed the title
Linking with CUDA-enabled v5.0.1 always warns about libcudart.so.12 needed by libmpi.so / required by binaries
Expand CUDA support documentation to account for all cuda dependent components.
Feb 7, 2024
janjust
changed the title
Expand CUDA support documentation to account for all cuda dependent components.
Expand CUDA support and fix documentation to account for all cuda dependent components.
Feb 7, 2024
Background information
What version of Open MPI are you using?
v5.0.1
Describe how Open MPI was installed
Open MPI was installed from Github release tarball. Configuration was done using this command line:
Note that I added coll-cuda to the list of mca-dsos. I'm not sure if it is intentionally missing in the documentation. I also tried without coll-cuda first, but with the same outcome.
CUDA Toolkit version 12.3 was installed in
CUDA_ROOT
. UCX was built against that CUDA toolkit. On cluster nodes with the drivers installed,ucx_info -d
reports the relevant CUDA and gdrcopy transports.Remark: The host used for compilation has the CUDA toolkit and runtime installed, but not the driver. So using
stubs
appears to be the way to go in that case (see #12264)Please describe the system on which you are running
Details of the problem
With Open MPI 4.1.4,I was able to build it such that one could compile and run binaries without the need of having the CUDA toolkit, runtime and drivers available on the node in use. However, with 5.0.1 configured as shown above, the linker warns about missing libcudart when building a binary (even a basic
MPI_Init/MPI_Finalize
program):With 4.1.4 I am able to compile and launch without those warnings/errors while having a CUDA-aware MPI. For 4.1.4 it was not the case that libmpi depends on libcudart, although 4.1.4 was configured using
--with-cuda=...
.If I got the SC'23 BoF slides correct, I understand that with 5.x Open MPI intends to integrate (link?) plugins directly into libmpi. But with the
enable-mca-dso
configure option I tried to put all CUDA related components into DSOs and thus away from libmpi. Nevertheless, libmpi has libcudart as a shared library dependency (see above). I also checked the symbols which libmpi needs but it does not appear to require any stuff from libcudart:So it appears to me that libmpi unnecessarily depends on libcudart. Is there some bug in the configure/compilation process or is it not possible anymore to build Open MPI libraries such that one can compile applications without CUDA runtime libraries being available? Given the dependency to libcudart of libmpi the statement from the documentation
does not appear to apply here. Or is there something wrong on my side?
Btw: The test program from the documentation may also deserve a call to
MPI_Init
in case one follows the DSO approach. Otherwise, it is reported that there is no CUDA support (using OMPI v5.0.1 with CUDA toolkit 12.3 available for compilation/execution):The text was updated successfully, but these errors were encountered: