Skip to content

Initialization segfaults with accelerator DSO component #12156

Closed
@wenduwan

Description

@wenduwan

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main only

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Build from source with CUDA support and build accelerator DSO component

./configure ... --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda ...

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

9095457 3rd-party/openpmix (v1.1.3-3932-g9095457b)
4676a3cb8f7eabde919f19bf70b1d211a79c2b6d 3rd-party/prrte (psrvr-v2.0.0rc1-4715-g4676a3cb8f)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux2, RHEL 8/9, Ubuntus
  • Computer hardware: EC2 hpc6a.48xlarge
  • Network type: EFA

Details of the problem

When I run an application with high rank-per-node, e.g. --map-by ppr:96:node, I get segfault

[ip-172-31-16-16:73623] *** Process received signal ***
[ip-172-31-16-16:73623] Signal: Segmentation fault (11)
[ip-172-31-16-16:73623] Signal code: Address not mapped (1)
[ip-172-31-16-16:73623] Failing at address: 0x7f7f60fddb3f
[ip-172-31-16-16.us-east-2.compute.internal:73185] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.

dmesg shows that it's from cuda

[79590.726378] cuda00001400006[73833]: segfault at 7f7f60fddb3f ip 00007f7f61fa7407 sp 00007f7f60d23eb0 error 4 in libgcc_s-7-20180712.so.1[7f7f61f99000+15000]
[79590.734804] Code: bb 0c 00 00 00 e9 f2 fe ff ff 40 80 ff 08 75 9d 80 78 01 00 75 97 0f b6 78 02 48 83 c0 02 e9 17 fd ff ff 49 8b 85 98 00 00 00 <80> 38 48 0f 85 67 fe ff ff 48 ba c7 c0 0f 00 00 00 0f 05 48 39 50

I did git bisect and identified this change https://github.com/open-mpi/ompi/pull/11617/files

I added a call after cuInit, and when the segfault happens I only see some ranks passed that point, so either

  1. cuInit panicked, or
  2. The accelerator component dlopen failed for some reason and never reached cuInit (not sure how this can happen)

Note: I can mitigate the issue by removing --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda. So it is likely related to DSO and dlopen.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions