Initialization segfaults with accelerator DSO component

## Background information


### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main only


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Build from source with CUDA support and build accelerator DSO component
```
./configure ... --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda ...
```

### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

 9095457b2c5b2370178a9a0ffaff7a6918af632e 3rd-party/openpmix (v1.1.3-3932-g9095457b)
 4676a3cb8f7eabde919f19bf70b1d211a79c2b6d 3rd-party/prrte (psrvr-v2.0.0rc1-4715-g4676a3cb8f)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

### Please describe the system on which you are running

* Operating system/version: Amazon Linux2, RHEL 8/9, Ubuntus
* Computer hardware: EC2 hpc6a.48xlarge
* Network type: EFA

-----------------------------

## Details of the problem

When I run an application with high rank-per-node, e.g. `--map-by ppr:96:node`, I get segfault
```
[ip-172-31-16-16:73623] *** Process received signal ***
[ip-172-31-16-16:73623] Signal: Segmentation fault (11)
[ip-172-31-16-16:73623] Signal code: Address not mapped (1)
[ip-172-31-16-16:73623] Failing at address: 0x7f7f60fddb3f
[ip-172-31-16-16.us-east-2.compute.internal:73185] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
```
dmesg shows that it's from cuda
```
[79590.726378] cuda00001400006[73833]: segfault at 7f7f60fddb3f ip 00007f7f61fa7407 sp 00007f7f60d23eb0 error 4 in libgcc_s-7-20180712.so.1[7f7f61f99000+15000]
[79590.734804] Code: bb 0c 00 00 00 e9 f2 fe ff ff 40 80 ff 08 75 9d 80 78 01 00 75 97 0f b6 78 02 48 83 c0 02 e9 17 fd ff ff 49 8b 85 98 00 00 00 <80> 38 48 0f 85 67 fe ff ff 48 ba c7 c0 0f 00 00 00 0f 05 48 39 50
```

I did git bisect and identified this change https://github.com/open-mpi/ompi/pull/11617/files

I added a call after `cuInit`, and when the segfault happens I only see some ranks passed that point, so either
1. `cuInit` panicked, or
1. The accelerator component dlopen failed for some reason and never reached `cuInit` (not sure how this can happen)

**Note**: I can mitigate the issue by removing `--enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda`. So it is likely related to DSO and dlopen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initialization segfaults with accelerator DSO component #12156

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Initialization segfaults with accelerator DSO component #12156

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.