-
Notifications
You must be signed in to change notification settings - Fork 910
Closed
Description
I'm using master and running a testcase between two infiniband machines as
% mpirun -mca pml ob1 -mca btl openib,vader,self -mca osc rdma -host mpi03,mpi04 ./x
and can hit the bug using pml yalla as well. It needs to span two hosts to fail.
Here's a gist of the testcase:
https://gist.github.com/markalle/ccbd729df75188378d538767c0321f4e
It boils down to
MPI_Accumulate(sbuf, 100, MPI_INT, to, 0, 100, MPI_INT, MPI_MAX, win);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Put(sbuf+2000, 100, mydt, to, 2000, 100, mydt, win);
MPI_Barrier(MPI_COMM_WORLD);
where mydt in the MPI_Put is non-contiguous.
The initiator of the Put and Accumulate ends up going from opal_progress() to handle_wc() where it's handling a completion callback for the Accumulate request, and hangs with the bottom of its stack trace looking like this
#0 ompi_osc_rdma_lock_btl_fop (module=0xb07840, peer=0xbd70f0, offset=16)
at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:63
#1 ompi_osc_rdma_lock_btl_op (module=0xb07840, peer=0xbd70f0, offset=16)
at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:88
#2 ompi_osc_rdma_lock_release_exclusive (module=0xb07840, peer=0xbd70f0,
offset=16)
at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:342
#3 0x00007fffe95c7805 in ompi_osc_rdma_acc_put_complete (btl=0x74f7b0,
endpoint=0x7fffe40125d0, local_address=0x7ffff0020020,
local_handle=0xc4e5d8, context=0xc63d90, data=0x0, status=0)
at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_accumulate.c:113
But I'm getting lost beyond that.
Metadata
Metadata
Assignees
Labels
No labels