Skip to content

Find libcuda.so automatically if --with-cuda-lib is not passed #12378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nsarka
Copy link

@nsarka nsarka commented Feb 26, 2024

In newer OpenMPI versions, it's required to pass both --with-cuda and --with-cuda-libdir in order for CUDA to be recognized by the build system. Ideally, just --with-cuda should be enough and the build system should be able to detect where the CUDA libraries are.

This PR allows removes the requirement that --with-cuda-libdir be passed and instead makes the build system search for libcuda.so inside of the directory passed with --with-cuda.

See issue: #12264

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

5bcaba0: Find libcuda.so automatically if --with-cuda-lib i...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@janjust janjust requested a review from jsquyres February 26, 2024 19:56
@nsarka nsarka force-pushed the nsarka/cuda-aware-ompi-find-libs-automatically branch 2 times, most recently from f718881 to c8e885d Compare February 26, 2024 20:33
@janjust janjust requested a review from hppritcha February 26, 2024 20:53
@hppritcha
Copy link
Member

this works for me at nersc perlmutter. I load cudatoolkit module which sets CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2. In my configure line I have

./configure blah blah --with-cuda=$CUDA_HOME

then when doing the libtool link think its linking in libcuda.so at /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2/targets/x86_64-linux/lib/stubs/libcuda.so,
but at runtime, because LD_LIBRARY_PATH is set to prepend another location for libcuda.so.1 i get

hpp@login19:~/ompi/install_pr12378/lib/openmpi> (nsarka/cuda-aware-ompi-find-libs-automatically)ldd mca_accelerator_cuda.so 
	linux-vdso.so.1 (0x00007ffcc2304000)
	libopen-pal.so.0 => /global/homes/h/hpp/ompi/install_pr12378/lib/libopen-pal.so.0 (0x00007fc4aeff6000)
	libfabric.so.1 => /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1 (0x00007fc4aec00000)
	libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007fc4aebd8000)
	libcuda.so.1 => /usr/local/cuda-12.2/compat/libcuda.so.1 (0x00007fc4acf6b000)

so this patch is doing what we'd like on perlmutter but the fixes for protecting for cases where just --with-cuda or --without-cuda need to be handled before I can approve this PR.

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

cfec4a9: Update to check /usr/local/cuda in case with_cuda ...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@nsarka nsarka force-pushed the nsarka/cuda-aware-ompi-find-libs-automatically branch 2 times, most recently from af62cc6 to 5204853 Compare February 27, 2024 17:28
@nsarka
Copy link
Author

nsarka commented Feb 27, 2024

Hi @hppritcha, I have updated the PR to handle the case where $with_cuda is not a directory.

@janjust janjust requested a review from hppritcha February 27, 2024 17:37
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
open-mpi#12264

Signed-off-by: Nick Sarkauskas <[email protected]>
@nsarka nsarka force-pushed the nsarka/cuda-aware-ompi-find-libs-automatically branch from 5204853 to cad3d9a Compare February 27, 2024 18:16
@nsarka
Copy link
Author

nsarka commented Feb 27, 2024

Ahh, I did not see the approval before force pushing again. The reason I force pushed was to add redirection to /dev/null in case that dirname command couldn't find the directory. It's not helpful output, just "dirname: missing argument".

@hppritcha
Copy link
Member

Ahh, I did not see the approval before force pushing again. The reason I force pushed was to add redirection to /dev/null in case that dirname command couldn't find the directory. It's not helpful output, just "dirname: missing argument".

this tweak looks good to me.

@janjust janjust merged commit a166ad7 into open-mpi:main Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants