Skip to content

allgather segfaults with v4.1.x #8248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rajachan opened this issue Nov 24, 2020 · 2 comments
Closed

allgather segfaults with v4.1.x #8248

rajachan opened this issue Nov 24, 2020 · 2 comments
Assignees
Milestone

Comments

@rajachan
Copy link
Member

rajachan commented Nov 24, 2020

Was testing v4.1.0rc4 with HAN, and I am hitting segfaults with osu_allgather with 2 or more ranks per node. Here's a tight reproducer (this happens irrespective of the MTL/BTL I use):

 /shared/ompi/install/bin/mpirun   --mca btl ^openib --mca mtl ^ofi   -n 4 -N 2 --hostfile /home/ec2-user/hfile --mca coll_han_priority 100 --mca coll_adapt_priority 100  /shared/omb/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -x 0 -i 1


# OSU MPI Allgather Latency Test v5.6.3
# Size       Avg Latency(us)
[ip-172-31-1-217:34415] *** Process received signal ***
[ip-172-31-1-217:34415] Signal: Segmentation fault (11)
[ip-172-31-1-217:34415] Signal code: Invalid permissions (2)
[ip-172-31-1-217:34415] Failing at address: 0x258c8f0
[ip-172-31-5-188:31522] *** Process received signal ***
[ip-172-31-5-188:31522] Signal: Segmentation fault (11)
[ip-172-31-5-188:31522] Signal code: Invalid permissions (2)
[ip-172-31-5-188:31522] Failing at address: 0xc78b20
[ip-172-31-1-217:34415] [ 0] /shared/libfabric/install/lib/libfabric.so.1(+0x136e44)[0x7f2eec884e44]
[ip-172-31-1-217:34415] [ 1] [ip-172-31-5-188:31522] [ 0] /shared/libfabric/install/lib/libfabric.so.1(+0x136e44)[0x7f4c82303e44]
[ip-172-31-5-188:31522] [ 1] /lib64/libpthread.so.0(+0x117e0)[0x7f4c8ea597e0]
[ip-172-31-5-188:31522] [ 2] [0xc78b20]
[ip-172-31-5-188:31522] *** End of error message ***
/lib64/libpthread.so.0(+0x117e0)[0x7f2ef8f0e7e0]
[ip-172-31-1-217:34415] [ 2] [0x258c8f0]
[ip-172-31-1-217:34415] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 31522 on node ip-172-31-5-188 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Stacktrace:

#0  0x00000000017158f0 in ?? ()
#1  0x00007f02f46ee450 in mca_coll_han_comm_create (comm=0x17130c0, han_module=0x17150a0) at coll_han_subcomms.c:231
#2  0x00007f02f46de54b in mca_coll_han_allreduce_intra (sbuf=0x1, rbuf=0x7ffe236a83e0, count=4, dtype=0x7f030f6ca540 <ompi_mpi_int>, op=0x60a940 <ompi_mpi_op_max>, comm=0x17130c0, module=0x17150a0) at coll_han_allreduce.c:100
#3  0x00007f02f46eacd8 in mca_coll_han_allreduce_intra_dynamic (sbuf=0x1, rbuf=0x7ffe236a83e0, count=4, dtype=0x7f030f6ca540 <ompi_mpi_int>, op=0x60a940 <ompi_mpi_op_max>, comm=0x17130c0, module=0x1714ca0) at coll_han_dynamic.c:628
#4  0x00007f02f46ed2a4 in mca_coll_han_topo_init (comm=0x60a720 <ompi_mpi_comm_world>, han_module=0x14fa290, num_topo_level=2) at coll_han_topo.c:114
#5  0x00007f02f46e3b6e in mca_coll_han_allgather_intra (sbuf=0x7f030052e000, scount=1, sdtype=0x609d20 <ompi_mpi_char>, rbuf=0x7f02f17ff000, rcount=1, rdtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>, module=0x14fa290)
    at coll_han_allgather.c:84
#6  0x00007f02f46ea5b8 in mca_coll_han_allgather_intra_dynamic (sbuf=0x7f030052e000, scount=1, sdtype=0x609d20 <ompi_mpi_char>, rbuf=0x7f02f17ff000, rcount=1, rdtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>,
    module=0x14fa290) at coll_han_dynamic.c:394
#7  0x00007f030f38d7cd in PMPI_Allgather (sendbuf=0x7f030052e000, sendcount=1, sendtype=0x609d20 <ompi_mpi_char>, recvbuf=0x7f02f17ff000, recvcount=1, recvtype=0x609d20 <ompi_mpi_char>, comm=0x60a720 <ompi_mpi_comm_world>)
    at pallgather.c:125
#8  0x0000000000401704 in main (argc=<optimized out>, argv=<optimized out>) at osu_allgather.c:97

9f228c9dab seems to have introduced this. Things look good with c614c54818. Same tests run to completion without HAN.

@devreal
Copy link
Contributor

devreal commented Nov 24, 2020

Opened #8250 and #8251 to fix a typo in mca_coll_han_comm_create_new that caused the segfault. The reason this worked before 9f228c9 is that the selection was broken before that...

@jsquyres
Copy link
Member

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants