Skip to content

SHMEM: missing progress when target is polling locally #6924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
devreal opened this issue Aug 23, 2019 · 1 comment
Closed

SHMEM: missing progress when target is polling locally #6924

devreal opened this issue Aug 23, 2019 · 1 comment

Comments

@devreal
Copy link
Contributor

devreal commented Aug 23, 2019

I have ported some of my benchmark from MPI RMA to shmem and found that Open MPI's shmem is not progressing atomic operations in at least one case.

The following reproducer gets stuck on my local system (shared memory only) if running 2 PEs with Open MPI 4.0.x (git-f96994b12f) and UCX 1.6.x (git-736d503):

#include <shmem.h>
#include <stdio.h>

#define NUM_REPS 1000

int main()
{
  shmem_init();

  uint64_t *ptr = shmem_malloc(sizeof(*ptr));
  *ptr = 0;
  shmem_barrier_all();

  if (shmem_my_pe() > 0) {
    for (int i = 0; i < NUM_REPS; ++i) {
      shmem_atomic_fetch_inc(ptr, 0);
    }
  } else {
    while (shmem_atomic_fetch(ptr, 0) != NUM_REPS) { ; }
  }

  shmem_barrier_all();

  printf("%lu\n", shmem_atomic_fetch(ptr, 0));

  shmem_free(ptr);
  shmem_finalize();
}

The origin is stuck in shmem_ulong_atomic_fetch_inc:

(gdb) bt
#0  0x00007fffd1bc6f40 in opal_progress@plt ()
   from ~/opt/openmpi-4.0.x/lib/openmpi/mca_atomic_ucx.so
#1  0x00007fffd1bc7dbd in mca_atomic_ucx_fadd ()
   from ~/opt/openmpi-4.0.x/lib/openmpi/mca_atomic_ucx.so
#2  0x00007ffff7b6c186 in shmem_ulong_atomic_fetch_inc ()
   from ~/opt/openmpi-4.0.x/lib/liboshmem.so.40
#3  0x0000555555554a18 in main () at test_oshmem_fetch_inc.c:16

The target keeps polling its local value:

(gdb) bt
#0  0x00007fffe77a21ea in ucp_atomic_fetch_nb (ep=0x7fffe3e3a000, 
    opcode=UCP_ATOMIC_FETCH_OP_FADD, value=0, result=0x7fffffffd330, 
    op_size=8, remote_addr=4278190288, rkey=0x555555941580, 
    cb=0x7fffe733d8f0 <opal_common_ucx_empty_complete_cb>)
    at ../../../src/ucp/rma/amo_send.c:120
#1  0x00007fffd1bc7c39 in mca_atomic_ucx_fadd ()
   from ~/opt/openmpi-4.0.x/lib/openmpi/mca_atomic_ucx.so
#2  0x00007ffff7b68d13 in shmem_ulong_atomic_fetch ()
   from ~/opt/openmpi-4.0.x/lib/liboshmem.so.40
#3  0x0000555555554a39 in main () at test_oshmem_fetch_inc.c:19

If PE 0 directly enters the barrier (without polling on the local value) the test runs successfully and the value reaches the expected NUM_REPS.

Potentially related to #6816

@devreal
Copy link
Contributor Author

devreal commented Mar 11, 2021

I believe this was fixed with #7632. Closing

@devreal devreal closed this as completed Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants