You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(gdb) bt
#0 0x00002ab3c3587976 in ucs_callbackq_leave (cbq=0x1b47290) at ../../../src/ucs/datastruct/callbackq.c:64#1 0x00002ab3c3588b16 in ucs_callbackq_slow_proxy (arg=0x1b47290) at ../../../src/ucs/datastruct/callbackq.c:391#2 0x00002ab3c2d42b9c in ucs_callbackq_dispatch (cbq=0x1b47290)
at /home/akvenkatesh/ucx/build/../src/ucs/datastruct/callbackq.h:208
#3 0x00002ab3c2d47d70 in uct_worker_progress (worker=0x1b47290) at /home/akvenkatesh/ucx/build/../src/uct/api/uct.h:1644#4 ucp_worker_progress (worker=0x1b5ef80) at ../../../src/ucp/core/ucp_worker.c:1381#5 0x00002ab3c2d4bd0e in ucp_rma_wait (worker=0x1b5ef80, user_req=0x312a8e0, op_name=0x2ab3c2dbde9a "atomic_fadd64")
at ../../../src/ucp/rma/rma.inl:49
#6 0x00002ab3c2d4e885 in ucp_atomic_fetch_b (ep=0x2ab3e2c0f0e0, opcode=UCP_ATOMIC_FETCH_OP_FADD, value=1, result=0x7ffde77b99f8,
size=8, remote_addr=53173648, rkey=0x3121b40, op_name=0x2ab3c2dbde9a "atomic_fadd64") at ../../../src/ucp/rma/amo_basic.c:263
#7 0x00002ab3c2d4ec55 in ucp_atomic_fadd64_inner (result=0x7ffde77b99f8, rkey=0x3121b40, remote_addr=53173648, add=1,
ep=0x2ab3e2c0f0e0) at ../../../src/ucp/rma/amo_basic.c:292
#8 ucp_atomic_fadd64 (ep=0x2ab3e2c0f0e0, add=1, remote_addr=53173648, rkey=0x3121b40, result=0x7ffde77b99f8)
at ../../../src/ucp/rma/amo_basic.c:288
#9 0x00002ab3cc634ece in start_shared (module=0x1ef6980, target=1) at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:28#10 0x00002ab3cc63545f in ompi_osc_ucx_lock (lock_type=2, target=1, assert=0, win=0x1b38310)
at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:145
#11 0x00002ab3afd6b639 in PMPI_Win_lock (lock_type=2, rank=1, assert=0, win=0x1b38310) at pwin_lock.c:66#12 0x0000000000401bf6 in run_acc_with_flush (rank=0, type=WIN_ALLOCATE) at ../../../mpi/one-sided/osu_acc_latency.c:207#13 0x00000000004018d8 in main (argc=1, argv=0x7ffde77b9c88) at ../../../mpi/one-sided/osu_acc_latency.c:128
(gdb) bt
#0 progress_callback () at ../../../../../ompi/mca/osc/ucx/osc_ucx_component.c:107#1 0x00002b66e1b65562 in opal_progress () at ../../opal/runtime/opal_progress.c:228#2 0x00002b66e0edc326 in ompi_request_wait_completion (req=0x32cdde0) at ../../ompi/request/request.h:413#3 0x00002b66e0edc360 in ompi_request_default_wait (req_ptr=0x7fffa3250a80, status=0x7fffa3250a60)
at ../../ompi/request/req_wait.c:42
#4 0x00002b66e0f82efd in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, source=0, rtag=-16, comm=0x607800 <ompi_mpi_comm_world>)
at ../../../../ompi/mca/coll/base/coll_base_barrier.c:64
#5 0x00002b66e0f835b8 in ompi_coll_base_barrier_intra_two_procs (comm=0x607800 <ompi_mpi_comm_world>, module=0x2f1ad30)
at ../../../../ompi/mca/coll/base/coll_base_barrier.c:300
#6 0x00002b66fd4f7a3d in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x607800 <ompi_mpi_comm_world>, module=0x2f1ad30)
at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:196
#7 0x00002b66e0efdf5e in PMPI_Barrier (comm=0x607800 <ompi_mpi_comm_world>) at pbarrier.c:63#8 0x0000000000401e2d in run_acc_with_flush (rank=1, type=WIN_ALLOCATE) at ../../../mpi/one-sided/osu_acc_latency.c:219#9 0x00000000004018d8 in main (argc=1, argv=0x7fffa3250cc8) at ../../../mpi/one-sided/osu_acc_latency.c:128
If I disable openib, it works for intra-node and I get this:
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.1.2 tar available here: https://www.open-mpi.org/software/ompi/v3.1/
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From tarball configured with:
Please describe the system on which you are running
Details of the problem
I'm using OSU benchmarks 5.3 for an intra-node accumulate latency benchmark that works fine with 3.1.0 but fails with 3.1.2.
If I use --mca osc ucx explicity I see a hang:
These are the backtraces:
If I disable openib, it works for intra-node and I get this:
The same doesn't work for inter-node case, with the following non-blocking allreduce failure stemming from Win_allocate:
Changing the allocation to Win_create or Win_create_dynamic doesn't change the problem because all these go to ompi_comm_dup eventually.
What am I missing?
The text was updated successfully, but these errors were encountered: