You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Was testing v4.1.0rc4 with HAN, and I am hitting segfaults with osu_allgather with 2 or more ranks per node. Here's a tight reproducer (this happens irrespective of the MTL/BTL I use):
/shared/ompi/install/bin/mpirun --mca btl ^openib --mca mtl ^ofi -n 4 -N 2 --hostfile /home/ec2-user/hfile --mca coll_han_priority 100 --mca coll_adapt_priority 100 /shared/omb/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -x 0 -i 1
# OSU MPI Allgather Latency Test v5.6.3
# Size Avg Latency(us)
[ip-172-31-1-217:34415] *** Process received signal ***
[ip-172-31-1-217:34415] Signal: Segmentation fault (11)
[ip-172-31-1-217:34415] Signal code: Invalid permissions (2)
[ip-172-31-1-217:34415] Failing at address: 0x258c8f0
[ip-172-31-5-188:31522] *** Process received signal ***
[ip-172-31-5-188:31522] Signal: Segmentation fault (11)
[ip-172-31-5-188:31522] Signal code: Invalid permissions (2)
[ip-172-31-5-188:31522] Failing at address: 0xc78b20
[ip-172-31-1-217:34415] [ 0] /shared/libfabric/install/lib/libfabric.so.1(+0x136e44)[0x7f2eec884e44]
[ip-172-31-1-217:34415] [ 1] [ip-172-31-5-188:31522] [ 0] /shared/libfabric/install/lib/libfabric.so.1(+0x136e44)[0x7f4c82303e44]
[ip-172-31-5-188:31522] [ 1] /lib64/libpthread.so.0(+0x117e0)[0x7f4c8ea597e0]
[ip-172-31-5-188:31522] [ 2] [0xc78b20]
[ip-172-31-5-188:31522] *** End of error message ***
/lib64/libpthread.so.0(+0x117e0)[0x7f2ef8f0e7e0]
[ip-172-31-1-217:34415] [ 2] [0x258c8f0]
[ip-172-31-1-217:34415] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 31522 on node ip-172-31-5-188 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Stacktrace:
#0 0x00000000017158f0 in ?? ()
#1 0x00007f02f46ee450 in mca_coll_han_comm_create (comm=0x17130c0, han_module=0x17150a0) at coll_han_subcomms.c:231
#2 0x00007f02f46de54b in mca_coll_han_allreduce_intra (sbuf=0x1, rbuf=0x7ffe236a83e0, count=4, dtype=0x7f030f6ca540 <ompi_mpi_int>, op=0x60a940 <ompi_mpi_op_max>, comm=0x17130c0, module=0x17150a0) at coll_han_allreduce.c:100
#3 0x00007f02f46eacd8 in mca_coll_han_allreduce_intra_dynamic (sbuf=0x1, rbuf=0x7ffe236a83e0, count=4, dtype=0x7f030f6ca540 <ompi_mpi_int>, op=0x60a940 <ompi_mpi_op_max>, comm=0x17130c0, module=0x1714ca0) at coll_han_dynamic.c:628
#4 0x00007f02f46ed2a4 in mca_coll_han_topo_init (comm=0x60a720 <ompi_mpi_comm_world>, han_module=0x14fa290, num_topo_level=2) at coll_han_topo.c:114
#5 0x00007f02f46e3b6e in mca_coll_han_allgather_intra (sbuf=0x7f030052e000, scount=1, sdtype=0x609d20 <ompi_mpi_char>, rbuf=0x7f02f17ff000, rcount=1, rdtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>, module=0x14fa290)
at coll_han_allgather.c:84
#6 0x00007f02f46ea5b8 in mca_coll_han_allgather_intra_dynamic (sbuf=0x7f030052e000, scount=1, sdtype=0x609d20 <ompi_mpi_char>, rbuf=0x7f02f17ff000, rcount=1, rdtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>,
module=0x14fa290) at coll_han_dynamic.c:394
#7 0x00007f030f38d7cd in PMPI_Allgather (sendbuf=0x7f030052e000, sendcount=1, sendtype=0x609d20 <ompi_mpi_char>, recvbuf=0x7f02f17ff000, recvcount=1, recvtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>)
at pallgather.c:125
#8 0x0000000000401704 in main (argc=<optimized out>, argv=<optimized out>) at osu_allgather.c:97
9f228c9dab seems to have introduced this. Things look good with c614c54818. Same tests run to completion without HAN.
The text was updated successfully, but these errors were encountered:
Opened #8250 and #8251 to fix a typo in mca_coll_han_comm_create_new that caused the segfault. The reason this worked before 9f228c9 is that the selection was broken before that...
Was testing v4.1.0rc4 with HAN, and I am hitting segfaults with osu_allgather with 2 or more ranks per node. Here's a tight reproducer (this happens irrespective of the MTL/BTL I use):
Stacktrace:
9f228c9dab
seems to have introduced this. Things look good withc614c54818
. Same tests run to completion without HAN.The text was updated successfully, but these errors were encountered: