Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main only
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Build from source with CUDA support and build accelerator DSO component
./configure ... --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda ...
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
9095457 3rd-party/openpmix (v1.1.3-3932-g9095457b)
4676a3cb8f7eabde919f19bf70b1d211a79c2b6d 3rd-party/prrte (psrvr-v2.0.0rc1-4715-g4676a3cb8f)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)
Please describe the system on which you are running
- Operating system/version: Amazon Linux2, RHEL 8/9, Ubuntus
- Computer hardware: EC2 hpc6a.48xlarge
- Network type: EFA
Details of the problem
When I run an application with high rank-per-node, e.g. --map-by ppr:96:node
, I get segfault
[ip-172-31-16-16:73623] *** Process received signal ***
[ip-172-31-16-16:73623] Signal: Segmentation fault (11)
[ip-172-31-16-16:73623] Signal code: Address not mapped (1)
[ip-172-31-16-16:73623] Failing at address: 0x7f7f60fddb3f
[ip-172-31-16-16.us-east-2.compute.internal:73185] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
dmesg shows that it's from cuda
[79590.726378] cuda00001400006[73833]: segfault at 7f7f60fddb3f ip 00007f7f61fa7407 sp 00007f7f60d23eb0 error 4 in libgcc_s-7-20180712.so.1[7f7f61f99000+15000]
[79590.734804] Code: bb 0c 00 00 00 e9 f2 fe ff ff 40 80 ff 08 75 9d 80 78 01 00 75 97 0f b6 78 02 48 83 c0 02 e9 17 fd ff ff 49 8b 85 98 00 00 00 <80> 38 48 0f 85 67 fe ff ff 48 ba c7 c0 0f 00 00 00 0f 05 48 39 50
I did git bisect and identified this change https://github.com/open-mpi/ompi/pull/11617/files
I added a call after cuInit
, and when the segfault happens I only see some ranks passed that point, so either
cuInit
panicked, or- The accelerator component dlopen failed for some reason and never reached
cuInit
(not sure how this can happen)
Note: I can mitigate the issue by removing --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda
. So it is likely related to DSO and dlopen.