You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have ported some of my benchmark from MPI RMA to shmem and found that Open MPI's shmem is not progressing atomic operations in at least one case.
The following reproducer gets stuck on my local system (shared memory only) if running 2 PEs with Open MPI 4.0.x (git-f96994b12f) and UCX 1.6.x (git-736d503):
The origin is stuck in shmem_ulong_atomic_fetch_inc:
(gdb) bt
#0 0x00007fffd1bc6f40 in opal_progress@plt ()
from ~/opt/openmpi-4.0.x/lib/openmpi/mca_atomic_ucx.so
#1 0x00007fffd1bc7dbd in mca_atomic_ucx_fadd ()
from ~/opt/openmpi-4.0.x/lib/openmpi/mca_atomic_ucx.so
#2 0x00007ffff7b6c186 in shmem_ulong_atomic_fetch_inc ()
from ~/opt/openmpi-4.0.x/lib/liboshmem.so.40
#3 0x0000555555554a18 in main () at test_oshmem_fetch_inc.c:16
The target keeps polling its local value:
(gdb) bt
#0 0x00007fffe77a21ea in ucp_atomic_fetch_nb (ep=0x7fffe3e3a000,
opcode=UCP_ATOMIC_FETCH_OP_FADD, value=0, result=0x7fffffffd330,
op_size=8, remote_addr=4278190288, rkey=0x555555941580,
cb=0x7fffe733d8f0 <opal_common_ucx_empty_complete_cb>)
at ../../../src/ucp/rma/amo_send.c:120
#1 0x00007fffd1bc7c39 in mca_atomic_ucx_fadd ()
from ~/opt/openmpi-4.0.x/lib/openmpi/mca_atomic_ucx.so
#2 0x00007ffff7b68d13 in shmem_ulong_atomic_fetch ()
from ~/opt/openmpi-4.0.x/lib/liboshmem.so.40
#3 0x0000555555554a39 in main () at test_oshmem_fetch_inc.c:19
If PE 0 directly enters the barrier (without polling on the local value) the test runs successfully and the value reaches the expected NUM_REPS.
I have ported some of my benchmark from MPI RMA to shmem and found that Open MPI's shmem is not progressing atomic operations in at least one case.
The following reproducer gets stuck on my local system (shared memory only) if running 2 PEs with Open MPI 4.0.x (git-f96994b12f) and UCX 1.6.x (git-736d503):
The origin is stuck in
shmem_ulong_atomic_fetch_inc
:The target keeps polling its local value:
If PE 0 directly enters the barrier (without polling on the local value) the test runs successfully and the value reaches the expected
NUM_REPS
.Potentially related to #6816
The text was updated successfully, but these errors were encountered: