v5.0.x OSU microbenchmarks CUDA memory segfault #12825

wenduwan · 2024-09-24T17:59:56Z

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.x head

$ git log --oneline -10
75795c04eb (HEAD -> v5.0.x, origin/v5.0.x) Merge pull request #12821 from Sergei-Lebedev/topic/coll_ucc_fix_buf_size_overflow_v5
a2868acd84 coll/ucc: fix int overflow in coll init
6f08eaf910 Merge pull request #12781 from janjust/v5.0.x
6f91498f59 Merge pull request #12809 from edgargabriel/pr/vulcan-aggr-list-leak-v5.0.x
ff740b4256 fcoll/vulcan: fix memory leak
d380ab6971 Merge pull request #12798 from wenduwan/fix_ipv6
ce3b892360 3rd-party/openpmix: include ipv6 fix
3968cab0fe Merge pull request #12800 from wenduwan/test_mpi4py
b4c98c9487 .github/workflow: set up runtime params right before mpi4py test
3bec944cf0 Merge pull request #12789 from jsquyres/pr/v5.0.x/gcc-14-complier-warning-fixes

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Source build

./configure --with-sge --without-verbs --disable-man-pages --enable-ipv6 LDFLAGS=-Wl,--as-needed --enable-prte-prefix-by-default --enable-mca-dso=all --with-libevent=external --with-hwloc=external --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-debug

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

$ git submodule status
 e62fa4252f0cadda29c4103e01b0e277e8180d3e 3rd-party/openpmix (v5.0.3-17-ge62fa425)
 b68a0acb32cfc0d3c19249e5514820555bcf438b 3rd-party/prrte (v3.0.6)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

Operating system/version: Amazon Linux 2
Computer hardware: AWS EC2 p4d.24xlarge

$ nvidia-smi
Tue Sep 24 17:56:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:10:1C.0 Off |                    0 |
| N/A   45C    P0             60W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:10:1D.0 Off |                    0 |
| N/A   41C    P0             57W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:20:1C.0 Off |                    0 |
| N/A   44C    P0             59W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:20:1D.0 Off |                    0 |
| N/A   39C    P0             55W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  |   00000000:90:1C.0 Off |                    0 |
| N/A   42C    P0             55W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  |   00000000:90:1D.0 Off |                    0 |
| N/A   41C    P0             58W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  |   00000000:A0:1C.0 Off |                    0 |
| N/A   46C    P0             62W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  |   00000000:A0:1D.0 Off |                    0 |
| N/A   40C    P0             63W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Network type: BTL/SM

Details of the problem

We are seeing segfaults with this commit: https://github.com/open-mpi/ompi/pull/12781/files#diff-750d0e8be09c5f4ee5f703b8ba2c735a3e1b8b807162936e55530ec721ec5b86

mpirun --wdir . -n 2 --mca pml ob1 openmpi-v5.0.6a1-v5.0.x-debug/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw  -d cuda D D

The backtrace is

(gdb) bt
#0  0x00007fd46edddbe8 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x00007fd46e5f3653 in opal_convertor_accelerator_memcpy (dest=0x7fd41755ce40, src=0x7fd43b200000, size=1, convertor=0x7ffedb4caf80) at opal_convertor.c:52
#2  0x00007fd46e5f3e93 in opal_convertor_pack (pConv=0x7ffedb4caf80, iov=0x7ffedb4cae70, out_size=0x7ffedb4cae84, max_data=0x7ffedb4cae88) at opal_convertor.c:284
#3  0x00007fd42114bb61 in mca_btl_sm_sendi (btl=0x7fd421350180 <mca_btl_sm>, endpoint=0x400c3830, convertor=0x7ffedb4caf80, header=0x7ffedb4cb0b0, header_size=16, payload_size=1,
    order=255 '\377', flags=3, tag=65 'A', descriptor=0x0) at btl_sm_sendi.c:98
#4  0x00007fd4208e9c2d in mca_bml_base_sendi (bml_btl=0x7fd41c068540, convertor=0x7ffedb4caf80, header=0x7ffedb4cb0b0, header_size=16, payload_size=1, order=255 '\377', flags=3,
    tag=65 'A', descriptor=0x0) at ../../../../ompi/mca/bml/bml.h:301
#5  0x00007fd4208eae09 in mca_pml_ob1_send_inline (buf=0x7fd43b200000, count=1, datatype=0x62ef80 <ompi_mpi_char>, dst=1, tag=100, seqn=2, dst_proc=0x40089a80, ob1_proc=0x3fbb9b40,
    endpoint=0x400c5880, comm=0x62f980 <ompi_mpi_comm_world>) at pml_ob1_isend.c:125
#6  0x00007fd4208eaf62 in mca_pml_ob1_isend (buf=0x7fd43b200000, count=1, datatype=0x62ef80 <ompi_mpi_char>, dst=1, tag=100, sendmode=MCA_PML_BASE_SEND_STANDARD,
    comm=0x62f980 <ompi_mpi_comm_world>, request=0x6310e0 <send_request>) at pml_ob1_isend.c:182
#7  0x00007fd46f550673 in PMPI_Isend (buf=0x7fd43b200000, count=1, type=0x62ef80 <ompi_mpi_char>, dest=1, tag=100, comm=0x62f980 <ompi_mpi_comm_world>, request=0x6310e0 <send_request>)
    at isend.c:101
#8  0x000000000040304f in main (argc=<optimized out>, argv=<optimized out>) at osu_bibw.c:216

We also get segfault with EFA network but so far the issue appears to be within CUDA memory copy.

The text was updated successfully, but these errors were encountered:

janjust · 2024-09-27T03:33:09Z

I found the issue, I had a missing symbol in the port, but really puzzling that it would work even with UCX.
I'll open the fix after I ran it through a few more tests.

janjust · 2024-09-30T21:27:51Z

fixed with #12828
reopen if issue persists

wenduwan added bug Severity: blocker Target: v5.0.x labels Sep 24, 2024

wenduwan assigned janjust Sep 24, 2024

janjust mentioned this issue Sep 27, 2024

accelecrator/cuda: fix for unresolved symbol #12828

Merged

janjust closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v5.0.x OSU microbenchmarks CUDA memory segfault #12825

v5.0.x OSU microbenchmarks CUDA memory segfault #12825

wenduwan commented Sep 24, 2024

janjust commented Sep 27, 2024

Uh oh!

janjust commented Sep 30, 2024

Uh oh!

v5.0.x OSU microbenchmarks CUDA memory segfault #12825

v5.0.x OSU microbenchmarks CUDA memory segfault #12825

Comments

wenduwan commented Sep 24, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

janjust commented Sep 27, 2024

Uh oh!

janjust commented Sep 30, 2024

Uh oh!

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.