You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please describe the system on which you are running
Operating system/version: Amazon Linux2
Computer hardware: p4d.24xlarge
Network type: Elastic Fabric Adapter
Details of the problem
osu-micro-benchmarks cuda tests are failing with segfault since #13055 is merged
mpirun --wdir . -n 2 --hostfile hostfile --map-by ppr:2:node --timeout 1800 -x LD_LIBRARY_PATH=/opt/amazon/efa/lib64 -x PATH /home/osu-micro-benchmarks/mpi/pt2pt/osu_latency --buffer-num multiple -d cuda H D
2025-02-12 18:03:27,068 - INFO - utils - mpirun output:
# OSU MPI-CUDA Latency Test
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 0.65
[ip-172-31-17-116:33408] *** Process received signal ***
[ip-172-31-17-116:33408] Signal: Segmentation fault (11)
[ip-172-31-17-116:33408] Signal code: Invalid permissions (2)
[ip-172-31-17-116:33408] Failing at address: 0x7f1303600000
[ip-172-31-17-116:33408] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f133b5258e0]
[ip-172-31-17-116:33408] [ 1] /lib64/libc.so.6(+0x14dbeb)[0x7f133b2b4beb]
[ip-172-31-17-116:33408] [ 2] /opt/amazon/efa/lib64/libfabric.so.1(+0x1f672)[0x7f12e78cc672]
[ip-172-31-17-116:33408] [ 3] /opt/amazon/efa/lib64/libfabric.so.1(+0x1f627)[0x7f12e78cc627]
....
[ip-172-31-17-116:33408] *** End of error message ***
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
The backtrace shows segfault comes from memcpy attempting to copy 1 byte from an inaccessible memory address.
(gdb) bt
#0 0x00007f91a9139be8 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1 0x00007f915561432f in ofi_memcpy (device=0, dest=0x7f9145412da0, src=0x7f9173200000, size=1)
at ./include/ofi_hmem.h:263
#2 0x00007f91556144eb in ofi_copy_from_hmem (iface=FI_HMEM_SYSTEM, device=0, dest=0x7f9145412da0, src=0x7f9173200000,
size=1) at ./include/ofi_hmem.h:405
#3 0x00007f9155614eb6 in ofi_copy_mr_iov (mr=0x0, iov=0x7ffd8b06c5f0, iov_count=1, offset=0, buf=0x7f9145412da0,
size=191, dir=0) at src/hmem.c:458
#4 0x00007f9155614f53 in ofi_copy_from_mr_iov (dest=0x7f9145412da0, size=192, mr=0x0, iov=0x7ffd8b06c5f0, iov_count=1,
iov_offset=0) at src/hmem.c:473
#5 0x00007f9155731e03 in smr_format_inline (cmd=0x7f9145412d60, mr=0x0, iov=0x7ffd8b06c5f0, count=1)
at prov/shm/src/smr_ep.c:277
#6 0x00007f9155732e20 in smr_do_inline (ep=0x20ce8f20, peer_smr=0x7f9145396000, id=1, peer_id=0, op=1, tag=1, data=0,
op_flags=131072, desc=0x0, iov=0x7ffd8b06c5f0, iov_count=1, total_len=1, context=0x0, cmd=0x7f9145412d60)
at prov/shm/src/smr_ep.c:647
#7 0x00007f915572b559 in smr_generic_inject (ep_fid=0x20ce8f20, buf=0x7f9173200000, len=1, dest_addr=1, tag=1, data=0,
op=1, op_flags=131072) at prov/shm/src/smr_msg.c:214
#8 0x00007f915572bb75 in smr_tinjectdata (ep_fid=0x20ce8f20, buf=0x7f9173200000, len=1, data=0, dest_addr=1, tag=1)
at prov/shm/src/smr_msg.c:394
#9 0x00007f91556c47fa in fi_tinjectdata (ep=0x20ce8f20, buf=0x7f9173200000, len=1, data=0, dest_addr=1, tag=1)
at ./include/rdma/fi_tagged.h:149
#10 0x00007f91556c6c0d in efa_rdm_msg_tinjectdata (ep_fid=0x20ce83c0, buf=0x7f9173200000, len=1, data=0, dest_addr=1,
tag=1) at prov/efa/src/rdm/efa_rdm_msg.c:594
#11 0x00007f9154103d5a in fi_tinjectdata (ep=0x20ce83c0, buf=0x7f9173200000, len=1, data=0, dest_addr=1, tag=1)
at /home/ec2-user/PortaFiducia/build/libraries/libfabric/v1.22.x/install/libfabric/include/rdma/fi_tagged.h:149
#12 0x00007f915410c12e in ompi_mtl_ofi_send_generic (ofi_cq_data=true, mode=MCA_PML_BASE_SEND_STANDARD,
convertor=0x7ffd8b06df60, tag=1, dest=1, comm=0x62e960 <ompi_mpi_comm_world>, mtl=0x7f9154335260 <ompi_mtl_ofi>)
at mtl_ofi.h:937
#13 ompi_mtl_ofi_send_true (mtl=0x7f9154335260 <ompi_mtl_ofi>, comm=0x62e960 <ompi_mpi_comm_world>, dest=1, tag=1,
convertor=0x7ffd8b06df60, mode=MCA_PML_BASE_SEND_STANDARD) at mtl_ofi_send_opt.c:38
#14 0x00007f9154985256 in mca_pml_cm_send (buf=0x7f9173200000, count=1, datatype=0x62df60 <ompi_mpi_char>, dst=1, tag=1,
sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x62e960 <ompi_mpi_comm_world>) at pml_cm.h:347
#15 0x00007f91a98cecbd in PMPI_Send (buf=0x7f9173200000, count=1, type=0x62df60 <ompi_mpi_char>, dest=1, tag=1,
comm=0x62e960 <ompi_mpi_comm_world>) at send.c:93
#16 0x00000000004029bc in main (argc=<optimized out>, argv=<optimized out>) at osu_latency.c:168
The text was updated successfully, but these errors were encountered:
@jiaxiyan could you please test whether #13097 fixes the issue for you? With this patch I was able to run osu_latency on cuda devices (but it was with UCX, not libfabric)
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
build main branch from source
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.08e41ed 3rd-party/openpmix (v1.1.3-4067-g08e41ed5)
30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte (psrvr-v2.0.0rc1-4839-g30cadc6746)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
Details of the problem
osu-micro-benchmarks cuda tests are failing with segfault since #13055 is merged
The backtrace shows segfault comes from memcpy attempting to copy 1 byte from an inaccessible memory address.
The text was updated successfully, but these errors were encountered: