-
Notifications
You must be signed in to change notification settings - Fork 900
Initialization segfaults with accelerator DSO component #12156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Limits
|
Tried |
Do you get a corefile(s)? |
Not always do I get a core file. Before I enabled debug, I got one coredump
After turning on debug, I consistently get the assertion error mentioned above
I turned on verbose and saw this
Then I check dmesg and invariably see the cuda segfault - I think that was the reason for the peer crash, causing broken pipe. |
Fixed by #12157 |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main only
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Build from source with CUDA support and build accelerator DSO component
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.9095457 3rd-party/openpmix (v1.1.3-3932-g9095457b)
4676a3cb8f7eabde919f19bf70b1d211a79c2b6d 3rd-party/prrte (psrvr-v2.0.0rc1-4715-g4676a3cb8f)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)
Please describe the system on which you are running
Details of the problem
When I run an application with high rank-per-node, e.g.
--map-by ppr:96:node
, I get segfaultdmesg shows that it's from cuda
I did git bisect and identified this change https://github.com/open-mpi/ompi/pull/11617/files
I added a call after
cuInit
, and when the segfault happens I only see some ranks passed that point, so eithercuInit
panicked, orcuInit
(not sure how this can happen)Note: I can mitigate the issue by removing
--enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda
. So it is likely related to DSO and dlopen.The text was updated successfully, but these errors were encountered: