-
Notifications
You must be signed in to change notification settings - Fork 920
Closed
Description
Was testing v4.1.0rc4 with HAN, and I am hitting segfaults with osu_allgather with 2 or more ranks per node. Here's a tight reproducer (this happens irrespective of the MTL/BTL I use):
/shared/ompi/install/bin/mpirun --mca btl ^openib --mca mtl ^ofi -n 4 -N 2 --hostfile /home/ec2-user/hfile --mca coll_han_priority 100 --mca coll_adapt_priority 100 /shared/omb/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -x 0 -i 1
# OSU MPI Allgather Latency Test v5.6.3
# Size Avg Latency(us)
[ip-172-31-1-217:34415] *** Process received signal ***
[ip-172-31-1-217:34415] Signal: Segmentation fault (11)
[ip-172-31-1-217:34415] Signal code: Invalid permissions (2)
[ip-172-31-1-217:34415] Failing at address: 0x258c8f0
[ip-172-31-5-188:31522] *** Process received signal ***
[ip-172-31-5-188:31522] Signal: Segmentation fault (11)
[ip-172-31-5-188:31522] Signal code: Invalid permissions (2)
[ip-172-31-5-188:31522] Failing at address: 0xc78b20
[ip-172-31-1-217:34415] [ 0] /shared/libfabric/install/lib/libfabric.so.1(+0x136e44)[0x7f2eec884e44]
[ip-172-31-1-217:34415] [ 1] [ip-172-31-5-188:31522] [ 0] /shared/libfabric/install/lib/libfabric.so.1(+0x136e44)[0x7f4c82303e44]
[ip-172-31-5-188:31522] [ 1] /lib64/libpthread.so.0(+0x117e0)[0x7f4c8ea597e0]
[ip-172-31-5-188:31522] [ 2] [0xc78b20]
[ip-172-31-5-188:31522] *** End of error message ***
/lib64/libpthread.so.0(+0x117e0)[0x7f2ef8f0e7e0]
[ip-172-31-1-217:34415] [ 2] [0x258c8f0]
[ip-172-31-1-217:34415] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 31522 on node ip-172-31-5-188 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Stacktrace:
#0 0x00000000017158f0 in ?? ()
#1 0x00007f02f46ee450 in mca_coll_han_comm_create (comm=0x17130c0, han_module=0x17150a0) at coll_han_subcomms.c:231
#2 0x00007f02f46de54b in mca_coll_han_allreduce_intra (sbuf=0x1, rbuf=0x7ffe236a83e0, count=4, dtype=0x7f030f6ca540 <ompi_mpi_int>, op=0x60a940 <ompi_mpi_op_max>, comm=0x17130c0, module=0x17150a0) at coll_han_allreduce.c:100
#3 0x00007f02f46eacd8 in mca_coll_han_allreduce_intra_dynamic (sbuf=0x1, rbuf=0x7ffe236a83e0, count=4, dtype=0x7f030f6ca540 <ompi_mpi_int>, op=0x60a940 <ompi_mpi_op_max>, comm=0x17130c0, module=0x1714ca0) at coll_han_dynamic.c:628
#4 0x00007f02f46ed2a4 in mca_coll_han_topo_init (comm=0x60a720 <ompi_mpi_comm_world>, han_module=0x14fa290, num_topo_level=2) at coll_han_topo.c:114
#5 0x00007f02f46e3b6e in mca_coll_han_allgather_intra (sbuf=0x7f030052e000, scount=1, sdtype=0x609d20 <ompi_mpi_char>, rbuf=0x7f02f17ff000, rcount=1, rdtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>, module=0x14fa290)
at coll_han_allgather.c:84
#6 0x00007f02f46ea5b8 in mca_coll_han_allgather_intra_dynamic (sbuf=0x7f030052e000, scount=1, sdtype=0x609d20 <ompi_mpi_char>, rbuf=0x7f02f17ff000, rcount=1, rdtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>,
module=0x14fa290) at coll_han_dynamic.c:394
#7 0x00007f030f38d7cd in PMPI_Allgather (sendbuf=0x7f030052e000, sendcount=1, sendtype=0x609d20 <ompi_mpi_char>, recvbuf=0x7f02f17ff000, recvcount=1, recvtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>)
at pallgather.c:125
#8 0x0000000000401704 in main (argc=<optimized out>, argv=<optimized out>) at osu_allgather.c:97
9f228c9dab
seems to have introduced this. Things look good with c614c54818
. Same tests run to completion without HAN.