-
Notifications
You must be signed in to change notification settings - Fork 926
Description
There seems to be a regression in Open MPI's implementation of atomic operations if the MCA parameter osc_rdma_acc_single_intrinsic
is set to true
: the attached test case results in what appears to be random results whereas without the MCA parameter the results meet the expectations.
Example runs:
$ mpirun -n 2 -N 1 ./mpi_fetch_op_local_remote
result:1000
$ mpirun -n 2 -N 1 -mca osc_rdma_acc_single_intrinsic true ./mpi_fetch_op_local_remote
result:1015
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:98: main: Assertion `sum == 1000*(comm_size-1)' failed.
mpi_fetch_op_local_remote.tar.gz
Built with:
$ mpicc mpi_fetch_op_local_remote.c -o mpi_fetch_op_local_remote
I just tested this with the 4.0.0 release. Setting the parameter with 3.1.2 works as expected.
This problem was observed on both a Cray XC40 and an IB-based cluster. Interestingly, the issue is only present if the local rank performs atomic updates (subtracting and readding a value), otherwise everything is fine.
I first reported this on the user ML but lost track of it: https://www.mail-archive.com/[email protected]/msg32834.html