Skip to content

Regression of osc_rdma_acc_single_intrinsic #6536

@devreal

Description

@devreal

There seems to be a regression in Open MPI's implementation of atomic operations if the MCA parameter osc_rdma_acc_single_intrinsic is set to true: the attached test case results in what appears to be random results whereas without the MCA parameter the results meet the expectations.

Example runs:

$ mpirun -n 2 -N 1 ./mpi_fetch_op_local_remote
result:1000
$ mpirun -n 2 -N 1 -mca osc_rdma_acc_single_intrinsic true ./mpi_fetch_op_local_remote
result:1015
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:98: main: Assertion `sum == 1000*(comm_size-1)' failed.

mpi_fetch_op_local_remote.tar.gz

Built with:

$ mpicc mpi_fetch_op_local_remote.c -o mpi_fetch_op_local_remote

I just tested this with the 4.0.0 release. Setting the parameter with 3.1.2 works as expected.
This problem was observed on both a Cray XC40 and an IB-based cluster. Interestingly, the issue is only present if the local rank performs atomic updates (subtracting and readding a value), otherwise everything is fine.

I first reported this on the user ML but lost track of it: https://www.mail-archive.com/[email protected]/msg32834.html

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions