Skip to content

Initialization segfaults with accelerator DSO component #12156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wenduwan opened this issue Dec 8, 2023 · 5 comments
Closed

Initialization segfaults with accelerator DSO component #12156

wenduwan opened this issue Dec 8, 2023 · 5 comments
Assignees

Comments

@wenduwan
Copy link
Contributor

wenduwan commented Dec 8, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main only

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Build from source with CUDA support and build accelerator DSO component

./configure ... --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda ...

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

9095457 3rd-party/openpmix (v1.1.3-3932-g9095457b)
4676a3cb8f7eabde919f19bf70b1d211a79c2b6d 3rd-party/prrte (psrvr-v2.0.0rc1-4715-g4676a3cb8f)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux2, RHEL 8/9, Ubuntus
  • Computer hardware: EC2 hpc6a.48xlarge
  • Network type: EFA

Details of the problem

When I run an application with high rank-per-node, e.g. --map-by ppr:96:node, I get segfault

[ip-172-31-16-16:73623] *** Process received signal ***
[ip-172-31-16-16:73623] Signal: Segmentation fault (11)
[ip-172-31-16-16:73623] Signal code: Address not mapped (1)
[ip-172-31-16-16:73623] Failing at address: 0x7f7f60fddb3f
[ip-172-31-16-16.us-east-2.compute.internal:73185] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.

dmesg shows that it's from cuda

[79590.726378] cuda00001400006[73833]: segfault at 7f7f60fddb3f ip 00007f7f61fa7407 sp 00007f7f60d23eb0 error 4 in libgcc_s-7-20180712.so.1[7f7f61f99000+15000]
[79590.734804] Code: bb 0c 00 00 00 e9 f2 fe ff ff 40 80 ff 08 75 9d 80 78 01 00 75 97 0f b6 78 02 48 83 c0 02 e9 17 fd ff ff 49 8b 85 98 00 00 00 <80> 38 48 0f 85 67 fe ff ff 48 ba c7 c0 0f 00 00 00 0f 05 48 39 50

I did git bisect and identified this change https://github.com/open-mpi/ompi/pull/11617/files

I added a call after cuInit, and when the segfault happens I only see some ranks passed that point, so either

  1. cuInit panicked, or
  2. The accelerator component dlopen failed for some reason and never reached cuInit (not sure how this can happen)

Note: I can mitigate the issue by removing --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda. So it is likely related to DSO and dlopen.

@wenduwan wenduwan added the bug label Dec 8, 2023
@wenduwan wenduwan self-assigned this Dec 8, 2023
@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 8, 2023

Limits

(env) -bash-4.2$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 8, 2023

Tried ulimit -n unlimited but I can still reproduce the segfault 1/5 attempts.

@hppritcha
Copy link
Member

Do you get a corefile(s)?

@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 8, 2023

Do you get a corefile(s)?

Not always do I get a core file.

Before I enabled debug, I got one coredump

#1  0x00007fcf07a91226 in abort () from /lib64/libc.so.6
#2  0x00007fcf07a88a4a in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007fcf07a88ac2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fcf083c7332 in pmix_list_item_destruct (item=0x7fcf00008060) at pmix_list.c:62
#5  0x00007fcf0850fa65 in pmix_obj_run_destructors (object=0x7fcf00008060)
    at ompi/3rd-party/openpmix/src/class/pmix_object.h:680
#6  0x00007fcf08514280 in pdes (p=0x7fcf00123a10) at pmix_globals.c:290
#7  0x00007fcf0850618d in pmix_obj_run_destructors (object=0x7fcf00123a10)
    at ompi/3rd-party/openpmix/src/class/pmix_object.h:680
#8  0x00007fcf0850939c in pmix_ptl_base_connection_handler (sd=-1, args=4, cbdata=0x7fcf00105420) at base/ptl_base_connection_hdlr.c:456
#9  0x00007fcf080333ad in event_base_loop () from /lib64/libevent_core-2.0.so.5
#10 0x00007fcf08365524 in progress_engine (obj=0x2174950) at runtime/pmix_progress_threads.c:230
#11 0x00007fcf07e1044b in start_thread () from /lib64/libpthread.so.0
#12 0x00007fcf07b4b52f in clone () from /lib64/libc.so.6

After turning on debug, I consistently get the assertion error mentioned above

[ip-172-31-16-16:73623] *** Process received signal ***
[ip-172-31-16-16:73623] Signal: Segmentation fault (11)
[ip-172-31-16-16:73623] Signal code: Address not mapped (1)
[ip-172-31-16-16:73623] Failing at address: 0x7f7f60fddb3f
[ip-172-31-16-16.us-east-2.compute.internal:73185] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.

I turned on verbose and saw this

[ip-172-31-16-27.us-east-2.compute.internal:67231] send blocking of 4 bytes to socket 318
[ip-172-31-16-27.us-east-2.compute.internal:67231] ptl:base:peer_send_blocking: send() to socket 318 failed: Broken pipe (32)      <------- oooops
[ip-172-31-16-27.us-east-2.compute.internal:67231] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.

Then I check dmesg and invariably see the cuda segfault - I think that was the reason for the peer crash, causing broken pipe.

@wenduwan
Copy link
Contributor Author

Fixed by #12157

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants