Skip to content

hang with Put;Accumulate;Barrier osc=rdma #3616

@markalle

Description

@markalle

I'm using master and running a testcase between two infiniband machines as
% mpirun -mca pml ob1 -mca btl openib,vader,self -mca osc rdma -host mpi03,mpi04 ./x
and can hit the bug using pml yalla as well. It needs to span two hosts to fail.

Here's a gist of the testcase:
https://gist.github.com/markalle/ccbd729df75188378d538767c0321f4e
It boils down to

        MPI_Accumulate(sbuf, 100, MPI_INT, to, 0, 100, MPI_INT, MPI_MAX, win);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Put(sbuf+2000, 100, mydt, to, 2000, 100, mydt, win);
        MPI_Barrier(MPI_COMM_WORLD);

where mydt in the MPI_Put is non-contiguous.

The initiator of the Put and Accumulate ends up going from opal_progress() to handle_wc() where it's handling a completion callback for the Accumulate request, and hangs with the bottom of its stack trace looking like this

#0  ompi_osc_rdma_lock_btl_fop (module=0xb07840, peer=0xbd70f0, offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:63
#1  ompi_osc_rdma_lock_btl_op (module=0xb07840, peer=0xbd70f0, offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:88
#2  ompi_osc_rdma_lock_release_exclusive (module=0xb07840, peer=0xbd70f0, 
    offset=16)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_lock.h:342
#3  0x00007fffe95c7805 in ompi_osc_rdma_acc_put_complete (btl=0x74f7b0, 
    endpoint=0x7fffe40125d0, local_address=0x7ffff0020020, 
    local_handle=0xc4e5d8, context=0xc63d90, data=0x0, status=0)
    at ../../../../../../opensrc/ompi/ompi/mca/osc/rdma/osc_rdma_accumulate.c:113

But I'm getting lost beyond that.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions